[HN Gopher] The Fastest FizzBuzz Implementation
       ___________________________________________________________________
        
       The Fastest FizzBuzz Implementation
        
       Author : Croftengea
       Score  : 248 points
       Date   : 2021-12-02 08:04 UTC (14 hours ago)
        
 (HTM) web link (tech.marksblogg.com)
 (TXT) w3m dump (tech.marksblogg.com)
        
       | DeathArrow wrote:
       | It seems there's a huge gap between what machine code compilers
       | can generate and hand written optimized assembler. There's a lot
       | of room left for compilers to be improved.
        
         | bawolff wrote:
         | Most programs aren't fizzbuzz. This is basically hand
         | optimizing a microbenchmark. While i'm sure compilers can
         | always be better i'm not sure this particular example
         | generalizes.
        
           | londons_explore wrote:
           | But hand optimizing a few microbenchmarks like this tells us
           | how far we might be able to improve compilers.
           | 
           | And from the looks of it, at least 100x is possible. It's a
           | shame therefore that even a 1% performance increase in a
           | compiler release is considered pretty noteworthy.
        
             | anthony_r wrote:
             | This is a very specific problem that lends itself to some
             | fantastically cpu-friendly optimizations. Doing a single
             | hash-map lookup here on the hot path would kill
             | performance, and let's not even get started on any serious
             | I/O (disk or network) as that would be multiple orders of
             | magnitude slower. Not many problems are just "cat something
             | extremely predictable into /dev/null".
        
               | londons_explore wrote:
               | But nearly all problems boil down to nearly zero actual
               | work...
               | 
               | Take booting up a computer for example. All that
               | _actually_ needs to be done is to display the login
               | screen on the screen. And we know from 60 fps video
               | playback, that the hardware is capable of displaying a
               | frame inside 16 milliseconds.
        
               | anthony_r wrote:
               | You can create an operating system that just displays a
               | static image within milliseconds yourself (shouldn't take
               | you more than a few weeks), it would work even when
               | compiled at -O0.
        
               | samhw wrote:
               | And so what? That's what this code does, and so a
               | compiler should be able to generate the optimal assembly.
               | Why would a compiler need to consider something that the
               | code in question is _not doing_?
        
               | bawolff wrote:
               | How would the compiler know that? This code for example
               | is only optimal if you are piping it somewhere instead of
               | displaying on the terminal. There is no way, even in
               | principle, for the compiler to know that this program is
               | always run with stdout connected to a pipe.
               | 
               | Moreover, if you mean optimal in the absolute sense,
               | pretty sure that is equivalent to solving the halting
               | problem.
               | 
               | If what your asking is: why can't computers take all
               | surounding fuzzy context into account and do precisely
               | the right thing, the answer is: because we haven't
               | invented strong AI yet.
        
               | samhw wrote:
               | Ah, sorry, I misunderstood you. I thought you were
               | placing the emphasis on what the code was doing, rather
               | than where it was being piped. But why do you think that
               | most of the possible optimisations would only be relevant
               | if the output were being piped to somewhere other than a
               | terminal?
               | 
               | Also, why would making it absolutely optimal be
               | "equivalent to solving the halting problem"? (Though
               | either way, no, I'm not talking about producing 100.0000%
               | optimal code - clearly there's plenty of daylight between
               | the current degree of optimality and that one.)
        
               | bawolff wrote:
               | > Ah, sorry, I misunderstood you. I thought you were
               | placing the emphasis on what the code was doing, rather
               | than where it was being piped. But why do you think that
               | most of the possible optimisations would only be relevant
               | if the output were being piped to somewhere other than a
               | terminal?
               | 
               | One of the primary optimizations is using vm_splice to
               | avoid memory copies. vm_splice only works if the file
               | descriptor is a pipe.
               | 
               | > Also, why would making it absolutely optimal be
               | "equivalent to solving the halting problem"?
               | 
               | https://en.m.wikipedia.org/wiki/Rice%27s_theorem
               | 
               | Or more specificly, if a program always halts (without
               | outputting things) then the best optimization would be to
               | have it halt immediately but to know that you have to
               | have solved the halting problem.
               | 
               | You're right of course, absolutes are a bit of a
               | distraction we just need something good for 99% of
               | programs. However that's still really hard, and an
               | optimization applied in the wrong context will slow down
               | a program. There's still lots of room for compilers to
               | grow, but the types of optimizations in use in this
               | example are at best probably NP-complete to determine if
               | they apply (i dont have any evidence for that though)
        
               | anthony_r wrote:
               | Yeah you could optimize GCC to work really well for those
               | problems, there's a great ROI for the "write something
               | trivial really quickly into /dev/null" class of problems.
               | 
               | Let us know how this goes if you embark on this
               | adventure.
        
               | samhw wrote:
               | If you genuinely understood me as saying that, then I
               | should clarify, to be absolutely foolproof:
               | 
               | That is not what I'm saying. I'm talking about having a
               | compiler which could produce more optimal code in any
               | given case, or most given cases. I obviously am not
               | suggesting that compilers optimise this and only this
               | case.
        
               | anthony_r wrote:
               | > I'm talking about having a compiler which could produce
               | more optimal code in any given case, or most given cases
               | 
               | Funny coincidence, that's exactly what the vast majority
               | of compiler code is for, with tens (or hundreds) of
               | thousands of man-years of work spent on that.
        
         | bonzini wrote:
         | The top 3 entries use completely different algorithms. The use
         | of assembly might help a bit, but not as much as the raw
         | numbers suggest.
        
         | andrewaylett wrote:
         | Back when I was employed to work on a compiler, our approach to
         | such things was to implement our run-time libraries in C and
         | fix the compiler so it was able to produce the desired output.
         | 
         | In this case, though, the biggest differences aren't coming
         | from C vs assembler but from walking away from platform
         | conventions -- ABI, memory layout, CPU compatibility. It might
         | be easier to write assembly than to teach the compiler how to
         | generate code that does this, but that's largely because you
         | don't need these optimisations in most cases. If the methods
         | used provided a general speedup, it would be worth implementing
         | them in compilers.
        
         | btilly wrote:
         | I suspect that a compiler which produced code that will crash
         | when you attempt to write to a file would be declared buggy and
         | unusable.
         | 
         | And yet, that is one of the optimizations that helps get this
         | one the speed crown.
        
         | pjc50 wrote:
         | Producing this solution requires not merely a clever compiler
         | but an extraordinary amount of creative intelligence.
        
       | recentdarkness wrote:
       | interesting for me was the fact, that the author talks about
       | themselves in the third person. Very unusual for some content
       | like this :D
        
         | lifthrasiir wrote:
         | The author of this post (Mark Litwintschik) is not same to the
         | author of the assembly solution (Alex Smith, also known as
         | ais523) :-)
        
           | nickdothutton wrote:
           | This chap is the developer. Kudos to him. https://www.wolfram
           | science.com/prizes/tm23/alex_smith_bio.ht...
        
         | Croftengea wrote:
         | > The developer behind the Assembler version goes by the handle
         | "ais523". > I've been unable to uncover this person's real-life
         | identity
        
       | lifthrasiir wrote:
       | > The developer behind the Assembler version goes by the handle
       | "ais523".
       | 
       | Alex Smith [1] is also known for winning the Wolfram 2,3 Turing
       | Machine Research Prize [2].
       | 
       | [1] https://www.cs.bham.ac.uk/~ais523/
       | 
       | [2] https://writings.stephenwolfram.com/2007/10/the-prize-is-
       | won...
        
       | jprupp wrote:
       | Hold my beer. Now were is that OpenCL programming book when I
       | need it?
        
       | bXVsbGVy wrote:
       | How did they tested it? Is the data actually being delivered, and
       | read, in order?
        
       | w0mbat wrote:
       | I played with FizzBuzz when this was mentioned here a month ago,
       | focusing on the algorithm not optimizations.
       | 
       | I stopped once I could do it in simple code with no math except
       | addition, 15 numbers at a time.                   for (int x = 0
       | ; x < 1000000 ; x+= 15) {           printf("%d\n%d\nfizz\n%d\nbuz
       | z\nfizz\n%d\n%d\nfizz\nbuzz\n%d\nfizz\n%d\n%d\nfizzbuzz\n",
       | 1 + x, 2 + x, 4 + x, 7 + x, 8 + x, 11 + x, 13 + x, 14 + x);
       | }
       | 
       | That is simple enough and fast enough for me.
        
         | WithinReason wrote:
         | Did you check what the throughput of this is?
        
       | richardwhiuk wrote:
       | > Before running the code, make sure your CPU does support AVX2.
       | ... You should see "sse2" at a minimum.
       | 
       | AVX2 is not SSE2 - SSE2 is much, much older.
       | 
       | You want to check for the avx2 flag itself.
        
       | GnarfGnarf wrote:
       | _581 lines of Assembler_
       | 
       | Interesting, but not very practical.
        
         | chii wrote:
         | is practicality really relevant to a competition to write the
         | fastest FizzBuzz implementation?
        
       | hackater wrote:
       | How do you learn about this? Does anyone have any good
       | recommendation on books, where you learn how to do these kind of
       | very low level optimizations?
        
       | Maursault wrote:
       | No machine code implementation? I wonder how much faster than
       | assembler it could be.
        
         | pxeger1 wrote:
         | Assembler has basically no abstractions on top of machine code;
         | just a human-readable syntax. So there would be no difference.
        
       | pxeger1 wrote:
       | This looks like a pretty rubbish blogspam regurgitation of the
       | original SE answer.
        
       | atorodius wrote:
       | Relevant: https://news.ycombinator.com/item?id=29031488
        
         | dang wrote:
         | Thanks! Macroexpanded:
         | 
         |  _55 GiB /s FizzBuzz_ -
         | https://news.ycombinator.com/item?id=29031488 - Oct 2021 (255
         | comments)
        
           | throw10920 wrote:
           | If I may ask, what's the default keybinding for _dang-
           | macroexpand-1_?
        
             | dang wrote:
             | crtl+alt+b
        
       | gumby wrote:
       | Pfft. No machine learning was used?
        
       | trevyn wrote:
       | I wonder how an M1 Max assembly version would fare.
        
         | Croftengea wrote:
         | ARM64 SIMD instructions perform in the same ballpark, see this:
         | https://news.ycombinator.com/item?id=25408853
        
         | SekstiNi wrote:
         | The Firestorm cores are capable of issuing two 128bit stores
         | per cycle [1], giving a bandwidth of 102.4 GBps. This matches
         | the experiments by Andrei with mixed reads/writes [2], and I
         | have personally verified that it is attainable for pure write
         | workloads to L1 as well as L2.
         | 
         | This does not mean 102.4 GBps is achievable on a single core,
         | but if L2 bandwidth is the limiting factor it very well could
         | be.
         | 
         | [1] https://dougallj.github.io/applecpu/firestorm.html [2]
         | https://www.anandtech.com/show/17024/apple-m1-max-performanc...
        
         | faeyanpiraat wrote:
         | Before running the code, make sure your CPU does support AVX2.
         | Most 64-bit Intel and AMD CPUs should. ARM CPUs, like those
         | found in newer Apple computers, Smartphones or Raspberry Pis,
         | won't support AVX2.
        
           | comonoid wrote:
           | ARM CPUs have NEON. I cannot say if NEON has all equivalent
           | commands, however, the threadstarter is talking about
           | complete rewrite, not just running it as is.
        
             | innocenat wrote:
             | NEON only has 128bit register. AVX has 256 bit register.
        
               | comonoid wrote:
               | Yep, it will most likely require to reorganize the
               | program, but by itself it doesn't necessary mean that the
               | result will be slower.
        
           | Tepix wrote:
           | That sentence seems silly. ARM assembly and x86-64 assembly
           | are two different beasts. Running the x86-64 code using
           | Rosetta 2 emulation seems counter-intuitive as well if you
           | want to get the fastest possible result on that machine.
        
       | buro9 wrote:
       | The simplest change for performance even in Python, JavaScript,
       | etc is to avoid the modulo function.
       | 
       | All of the top contenders on
       | https://codegolf.stackexchange.com/questions/215216/high-thr...
       | do exactly this.
       | 
       | Yes there is a lot of optimization specific to languages, OS,
       | CPU... but the takeaway for me isn't that you have to go that
       | extent to build an order of magnitude improvement, you can still
       | achieve many order of magnitudes of improvement by understanding
       | what is happening when you use something like the modulo function
       | and that if you have requirements where you can avoid using it
       | then that's the win.
       | 
       | For FizzBuzz you are told it "iterates from 1 to <some
       | number>"... and modulo would be perfect for "take a single number
       | and produce Fizz, Buzz or FizzBuzz if it's divisable by 3, 5, or
       | both 3 and 5"... but as we're iterating, counters are better
       | suited to the task.
       | 
       | I love seeing epic code golf (and the solution outlined in the
       | article is it), but the takeaway for the majority of engineers is
       | that significant improvements can be made with a readable and
       | maintainable solution just by understanding the requirements and
       | comp sci fundamentals.
        
         | jiggawatts wrote:
         | The _really_ heavyweight use of div /mod is to produce the
         | base-10 printed decimal numbers, not in the main control loop.
         | 
         | The record-beating versions all do something clever to optimise
         | the printing of the numbers between the "Fizz" and "Buzz"
         | strings.
        
       | solmag wrote:
       | I think the original author said that it was harder to make this
       | than his masters thesis.
        
       | parhamn wrote:
       | When I saw this a few weeks ago it set off a lot of thinking
       | about compilers and ML.
       | 
       | Even the unoptimized C version is 10x slower than the one that is
       | painstakingly optimized.
       | 
       | Some of this performance gain will certain be achievable by ML/AI
       | assisted compilers and translators. In that case, even a 2x
       | speedup would be game changing. Is anyone using ML in compilers
       | yet? The part where you verify the code still does what you
       | wanted it to do seems trickiest.
        
         | m3at wrote:
         | > Is anyone using ML in compilers yet?
         | 
         | Yes :)
         | 
         | Check out TVM for something mainstream: http://tvm.apache.org/
         | 
         | And LibNC for something esoteric (but elegant):
         | https://bellard.org/libnc/
        
       | andi999 wrote:
       | I dont understand the FizzBuzz thingy. I mean isnt it just about
       | who has the fastest print method? Correct me please, but this
       | problem has period 15, so if you just write:
       | 
       | for (i=1;i<n;i+=15) { print i; print i+1; print 'fizz' print i+3
       | ... } You have the fastest , dont you? (if the ending n is
       | important then one can just write 15 of such loops/endgame
       | depending on the modulus). Since there is no 'if' the branch
       | predictor should be on your side.
        
         | zkldi wrote:
         | Did you even read the OP?
        
           | andi999 wrote:
           | The OP is great. And basically it is about how to print
           | faster. I was more commenting on the fast fizzbuzz challenge
           | itself.
        
         | Tepix wrote:
         | You're printing the number too so, no, it does not heave a
         | period of 15.
        
       | isodev wrote:
       | Come on now, are we really using fizzbuzz as a performance
       | benchmark? Seriously, don't base a real-world decisions on this.
       | 
       | PS. The Rust implementation is also not very optimal, for
       | example, the stdout is not locked so... what are we even trying
       | to show :-).
        
         | tourist2d wrote:
         | Can you really not tell the difference between a fun challenge
         | and a real world scenario?
        
       | sshb wrote:
       | I wonder about FPGA and io_uring implementations performances
        
       | Terretta wrote:
       | YAWFB: Yet another wrong fizzbuzz.
       | 
       | "fizz" % 3, "buzz" % 5, what is the % 15 for?
       | 
       | You can blow a interviewee's mind by saying, great, now "bang" %
       | 7... and (hopefully!) watch the light dawn.
       | 
       | // why so serious? :-)
        
         | LeifCarrotson wrote:
         | The % 15 is because they're using 'elif' or 'else if'; they
         | only process each entry once and they exit the function when a
         | successful comparison is made.
         | 
         | It's true that you could do this with something like:
         | result = ''         if x % 3 == 0: result += 'fizz'         if
         | x % 5 == 0: result += 'buzz'         if result == '': result +=
         | str(x)         return result
         | 
         | to append fizz when for 15 and also continue processing and
         | append buzz, but this requires an extra comparison to check
         | that you did neither of those to print the number.
        
           | Terretta wrote:
           | we all know what the %15 is for.
           | 
           | but note that if you do 3 comparisons, or 2 comparisons and
           | an "extra" comparison, you've done the same number of
           | comparisons, it's not really 'extra'
           | 
           | then go ahead and keep adding factors and see which results
           | in more comparisons as we rack up the actually extra
           | comparisons for various combinations of factors
           | 
           | btw and ftw, your modulo and string test example, more
           | correct than the usual implementation, is among the most
           | efficient:
           | 
           | http://www.zoharbabin.com/which-fizzbuzz-solution-is-the-
           | mos...
           | 
           | // still not so serious...
        
       | javier10e6 wrote:
       | Asm 56 GB/s (you are reading Sanskrit) C 41 GB/s (you are reading
       | US Tax Code) Rust 30 GB/s (you are reading English) Python (why
       | bother) (you are speed reading ) So many tools, so many
       | languages, life is great.
        
       | opentokix wrote:
       | Imagine writing this asm-implementation to the 22 y/o google
       | recruiter and fail the interview. :D
        
       | DeathArrow wrote:
       | The most interesting thing is the C version is almost as fast as
       | assembler version.
        
         | bonzini wrote:
         | The most interesting thing is that it took one day to write the
         | C version and several months for the assembly version, because
         | the C version uses a simpler algorithm. Talk about diminishing
         | returns. :)
        
           | pokepim wrote:
           | And you could copy paste Python version from SO and have it
           | running in less than a minute, so here's the winner :)
        
             | bonzini wrote:
             | If the Python version is 2000 times slower, the breakeven
             | would be after a few terabytes of FizzBuzz.
        
               | Qem wrote:
               | The Python version posted is about 85 MiB/sec, so only
               | 600 times slower than the fastest one.
        
               | bonzini wrote:
               | Ah, forgot about pypy!
        
       | sillysaurusx wrote:
       | > As of this writing, the 3rd fastest submission is written in
       | Rust and produces out at a rate of 3 GB/s, 2nd is written in C
       | and produces at a rate of 41 GB/s and the fastest is written in
       | Assembler and produces at a rate of 56 GB/s.
       | 
       | This makes me happy. The order is exactly as it should be, much
       | to the disappointment of the Rustaceans.
       | 
       | It's a nice reminder that for the ultimate performance, there is
       | no substitute for knowing exactly what the assembler is doing.
       | 
       | It's also a nice reminder that most of us will never want to do
       | that.
       | 
       | EDIT: Whoa. It's late, and I thought Rust was at 30 GB/s instead
       | of 3, i.e. roughly neck and neck with C. I meant to make a good-
       | natured joke. There must be something wrong with the Rust
       | implementation to be 10x slower; I wonder what it is.
       | 
       | As I mentioned further down in the comments, I'll donate $500 to
       | Watsi if someone manages to displace the C implementation in Rust
       | or any other language besides C/C++/ASM. Be sure to DM me on
       | Twitter (https://twitter.com/theshawwn) if you do, both so I'll
       | see it and so that I can congratulate you. I'm sure it's
       | possible, but it's sort of interesting that the activation energy
       | is so high.
        
         | lawn wrote:
         | So you think that it's expected that C is 10 times faster than
         | Rust?
        
           | sillysaurusx wrote:
           | Wow. Perhaps I should be embarrassed, but I mentally filled
           | in "30 GB/s" instead of 3 GB/s. I thought Rust was mostly
           | neck and neck with C. I see now why a lot of fellows
           | responded in a defensive way.
           | 
           | What's it doing to be so much slower? That's shocking. I may
           | not _enjoy_ Rust, but I do respect its ability to approach
           | C++ performance.
        
             | IshKebab wrote:
             | > That's shocking.
             | 
             | Not really. This is a hyper optimised microbenchmark. The
             | actual work being done is so trivial that any "overhead" is
             | really significant and you can get order of magnitude
             | speedups by doing arcane things with caches, eliminating
             | tiny inefficiencies and so on.
        
               | xxpor wrote:
               | Complete speculation, but I think you'd find that's true
               | for most problems. Very little CPU time is spent on the
               | business logic, and the rest is overhead.
        
             | misja111 wrote:
             | As far as I understand, that #2 C program contains hardly
             | any C code but mostly inlined assembly.
        
               | bonzini wrote:
               | The program itself is 100% C, but it spends 99.9999% of
               | the time in assembly code that is generated at runtime.
        
             | friedman23 wrote:
             | I'm curious, how did rust affect you so badly that you hate
             | it so much? If it's just silly people suggesting it
             | everywhere I think you have some personal issues you need
             | to resolve, language choice isn't something to get
             | emotional about.
        
               | [deleted]
        
               | ask_b123 wrote:
               | I've not seen any indication that sillysaurusx _hates_
               | Rust.
        
             | [deleted]
        
             | Arnavion wrote:
             | At the very least, the itoap crate being used there does
             | divisions by and comparisons to powers of 10 as part of
             | stringifying integers, which the C and ASM versions don't
             | seem to be doing. A crate with a function for stringifying
             | arbitrary integers cannot take advantage of the fact that
             | it's going to be used for stringifying consecutive
             | integers, as an inlined solution can.
        
         | GrayShade wrote:
         | You do seem to have an axe to grind with the people using Rust.
         | 
         | I don't see any reason why the same thing [1] could not be
         | implemented in Rust.
         | 
         | EDIT: I got caught by an off-by-one there, the second fastest
         | solution (doing JIT compilation) is actually [2].
         | 
         | [1]: https://codegolf.stackexchange.com/a/215231
         | 
         | [2]: https://codegolf.stackexchange.com/a/236737
        
           | sillysaurusx wrote:
           | Much like speedruns, it's up to you to prove it :)
           | 
           | I just enjoy dosing the Rusties with reality every so often.
           | 
           | If you write a Rust version that can beat the C version, I'll
           | donate $500 to Watsi. Good luck!
        
             | gerbilly wrote:
             | Ultimately, your argument is a troll because you are trying
             | to tell people what they should value in a programming
             | language.
        
               | bawolff wrote:
               | Pretty sure the fact its literally a speed contest is
               | telling people what to value in a programming language.
        
             | KingOfCoders wrote:
             | "I just enjoy dosing the Rusties with reality every so
             | often."
             | 
             | Aka troll.
        
               | AnIdiotOnTheNet wrote:
               | Not necessarily. Language evangelists can be incredibly
               | annoying... actually, evangelists in general can be
               | really annoying. It is cathartic to prove annoying people
               | wrong every once in a while.
        
               | Ygg2 wrote:
               | Otoh. Being a dick, to majority of people because of acts
               | of minority is worse.
        
               | GrayShade wrote:
               | I don't know. I use Rust myself, but I've seen more
               | people complaining about Rust evangelism than people
               | actually doing that.
               | 
               | The RESF is a bad meme.
        
               | colejohnson66 wrote:
               | When there's a memory related bug (or CVE) in a program
               | written in C or C++, there's almost always some comment
               | about how the Rust compiler would've caught it
        
               | KingOfCoders wrote:
               | From Rust developers or from app users? As an app user I
               | would prefer if apps are written in Rust and Im reminded
               | with every CVE for a C program.
        
               | tialaramex wrote:
               | On HN there is often: A comment about Rust would catch
               | this (sometimes sarcastic), an explanation about how the
               | bug isn't "really" memory safety and so Rust wouldn't
               | catch it, followed by a comment explaining actually Rust
               | does catch it.
               | 
               | The recent Linux mutex security bug is an example. That's
               | not a memory safety bug. Surely Rust doesn't help? But
               | Rust's mutual exclusion design relies on the Rust type
               | system to give us safer locks, in Rust you protect things
               | by having "lock the mutex" be the way you get the
               | reference to the protected object. Thus when the faulty
               | code locks the wrong mutex in Rust it would get the wrong
               | object, and either _not compile_ (you lock variable A and
               | then you 're trying to change B but you don't even have
               | B) or _not work_ (you always lock variable A, then change
               | variable A, but your code should change B, it does not
               | work, trivially reproducible bug in your code)
        
               | sillysaurusx wrote:
               | Oh? Why is it trolling to say that Rust isn't as fast as
               | C in this case?
               | 
               | If it were, you'd claim the prize. If you really want to
               | hurt a troll, wouldn't you try to separate them from
               | their $500?
        
               | KingOfCoders wrote:
               | You still assume I care. As stated above I don't care.
               | It's in your head only that every Rust developer writes
               | Rust because she thinks it's faster than C.
        
               | detaro wrote:
               | > _to say that Rust isn 't as fast as C in this case_
               | 
               | That's not what the quoted statement is doing.
        
               | sillysaurusx wrote:
               | I think we'll have to agree to disagree. It's
               | illustrative that a seemingly simple C program can't be
               | easily ported to Rust. If it could be, someone would have
               | already done it -- it's been a few hours.
               | 
               | It's also illustrative to examine exactly what the
               | hangups are. You might even get more Rust converts if you
               | acknowledge that there are serious engineering burdens to
               | overcome when deciding to use Rust.
        
               | [deleted]
        
               | samhw wrote:
               | For the record, it's not the anti-Rust sentiment which
               | people find annoying, it's the treatment of programming
               | languages as though they were football teams.
               | 
               | If you want to make a case against Rust, then let's get
               | concrete: _why_ is one Turing-complete compiled language
               | that can make arbitrary syscalls not capable of compiling
               | down to the same assembly as another?
               | 
               |  _That_ would be an interesting discussion; this is not.
        
               | Retric wrote:
               | You can dig into it, but the details are mostly only
               | relevant to people designing computer languages and
               | tooling.
               | 
               | Aka Compilers, as long as the machine code is equivalent
               | to the source code their output can be wildly different.
               | However, someone needs to actually write them which means
               | both linguistic differences and the sheer effort that was
               | actually put into the compiler are both significant. You
               | shouldn't assume that every language is operating on an
               | even playing field because their simply not. The obvious
               | solution of compiling to a second language is never quite
               | as capable as using the language directly.
        
               | samhw wrote:
               | > Compilers, as long as the machine code is equivalent to
               | the source code their output can be wildly different
               | 
               | I don't understand what this means: the output of a
               | compiler _is_ the machine code from the source code, no?
               | 
               | Also, Rust uses LLVM as a backend, which has benefited
               | from lots of work given its use for C, C++, ObjC, Swift,
               | &c. Compiling the C with Clang/LLVM might give a more
               | interesting insight into the differences between Rust and
               | C.
               | 
               | > The obvious solution of compiling to a second language
               | is never quite as capable as using the language directly.
               | 
               | I'm not sure what you mean by this, or rather how it's
               | related to what we're talking about?
        
               | Retric wrote:
               | > The output is the machine code.
               | 
               | You get different machine code using different compilers
               | on identical pieces of source code. In theory the results
               | of running that source code should have equivalent
               | results even if one is more efficient than another. If
               | for example one compiler uses a ASM instruction that
               | another never uses then nothing the programmer does can
               | get the second compiler to be as efficient as the first.
               | The same is true of various optimizations etc, sometimes
               | a compiler is just better.
               | 
               | Bring a different language into the picture and things
               | get even more divergent.
               | 
               | > how it's related
               | 
               | It's specifically in reference to rust use of LLVM not
               | being sufficient to guarantee equivalence. Many languages
               | target the JVM, that does not make them equivalent.
        
               | samhw wrote:
               | > You get different machine code using different
               | compilers on identical pieces of source code.
               | 
               | Ah, right - well I'm certainly not denying that. (I'm
               | familiar with the topic, to be clear: I've written
               | compilers. I just wasn't clear what exactly you were
               | saying.)
               | 
               | > It's specifically in reference to rust use of LLVM not
               | being sufficient to guarantee equivalence.
               | 
               | Formally, no. In practice, given that fact, given the
               | experience of the maintainers of both languages, and
               | given the 'compilability' of both languages, it would be
               | surprising if there were a 10x gulf between them.
        
               | taneq wrote:
               | That kind of behaviour (edit: theirs, not yours) is
               | usually ultimately motivated by fear. ;)
        
             | trevyn wrote:
             | No no, C is great for writing fast FizzBuzz
             | implementations; any larger program is hard to get correct.
             | :-)
        
         | KingOfCoders wrote:
         | "disappointment of the Rustaceans."
         | 
         | I'm using Rust and I'm not dissappointed. Why should I?
         | 
         | But I do appreciate how you create imaginary people in your
         | head, troll them and mentally pad yourself for doing so.
        
           | isoprophlex wrote:
           | All the rust users rising to the bait in this thread sure
           | aren't imaginary
        
             | [deleted]
        
             | KingOfCoders wrote:
             | What comment comes to your mind of a Rust developer who
             | thinks Rust is faster than C and needs a dose of reality?
        
               | colejohnson66 wrote:
               | Rust is a memory safe _systems_ programming language. As
               | a systems programming language, it can be used in the
               | same areas that C can (see Redox OS). I 've seen quite a
               | few Rust users online that have claimed that, because of
               | that, it can be just as fast as C.
        
               | tialaramex wrote:
               | But it can be _just as fast_ as C. The C program that 's
               | so much faster here isn't "really" a C program it's just
               | a fancy way to shovel machine code, which you could of
               | course also do (unsafely) in Rust if you chose to do so
               | but why?
               | 
               | However what you were asked about is claims it's _faster_
               | and there micro-benchmarks like this are not revealing.
               | At scale you 're obliged to limit yourself to what you
               | can write _that works_ and in Rust the safety lets you
               | reach higher - which is how it achieves better
               | performance. In fact that 's what Mozilla used Rust for.
               | They'd failed more than once to ship a concurrent CSS
               | speed-up, finding it too hard to successfully get the
               | bugs out and so returning to a single threaded CSS
               | implementation - but the attempt in Rust (Quantum) worked
               | just fine.
               | 
               | This is the same reason C++ is faster, a C++ iterator
               | isn't _faster_ than the naive C for loop, in fact they
               | emit identical code, but the iterator is easier to think
               | about, which means you can confidently write more
               | complicated software using this abstraction. Rust just
               | provides the safety to go higher still.
        
               | KingOfCoders wrote:
               | What is the opinion of Golang users on this?
        
               | politician wrote:
               | This is really cool. It should be obvious that a few
               | thousand hand-rolled assembly opcodes targeting a fixed
               | architecture would win the race. If the size of the
               | solution grew a couple of orders of magnitude larger
               | though, I'd expect that the micro optimization challenge
               | would overwhelm the author and compilers would be more
               | competitive.
               | 
               | Is the Table of Contents of your upcoming book public?
        
               | KingOfCoders wrote:
               | No not yet, incorporating first draft feedback.
               | 
               | You made my evening for asking <3
        
             | [deleted]
        
         | tester34 wrote:
         | >This makes me happy. The order is exactly as it should be,
         | much to the disappointment of the Rustaceans.
         | 
         | Why it makes you happy? it's a reminder how much we're losing
         | performance, energy and stuff when we're writing in
         | safe/expressive/easy to use languages like Java, C#, Python
         | yada yada
        
         | Arnavion wrote:
         | >It's a nice reminder that for the ultimate performance, there
         | is no substitute for knowing exactly what the assembler is
         | doing.
         | 
         | I didn't realize viewing the assembly of a compiled binary was
         | something reserved for C. Someone should tell rust.godbolt.org
         | that it's not supposed to exist.
        
           | dahfizz wrote:
           | Go read the second place C entry[1]. Its a JIT for assembly
           | code.
           | 
           | [1] https://codegolf.stackexchange.com/questions/215216/high-
           | thr...
        
           | samhw wrote:
           | I think their point is that C is sufficiently close to
           | assembler that you know exactly what assembly any given line
           | of C will produce. This is an old canard which is not really
           | true (https://zenhack.net/2015/08/02/c-is-not-portable-
           | assembly.ht...), but it's common among people who derive a
           | sense of identity from writing C.
        
         | [deleted]
        
         | bartvk wrote:
         | It's not just knowing exactly what the assembler is doing, but
         | also what the kernel is doing (but maybe you also meant this).
         | "This implementation only supports Linux", and then continues
         | about the use of a bunch of system calls:
         | 
         | * "asks the Kernel to defragment the physical memory it's using
         | via the madvise system call" * "use 2 MB of RAM each even
         | though neither need 2 MB. This keeps the page table orderly" *
         | "Most competing FizzBuzz implementations call the write system
         | call which can lead to data needing to be copied between user
         | and Kernel space. So vmsplice is used instead."
         | 
         | And then the most fascinating bit: "The CPU's L2 cache is split
         | into two halves, one for calculating the output of FizzBuzz,
         | the other storing the result of the last calculation batch.
         | This means there is no need to use any slower L3 memory
         | operations."
         | 
         | "Slower L3 memory operations", I've not read that sentence
         | before :D
        
           | MayeulC wrote:
           | > the most fascinating bit
           | 
           | It is really fascinating, but I disagree a bit with your pick
           | :)
           | 
           | To me, the most fascinating part is using a bytecode
           | interpreter. It makes perfect sense, it's basically
           | optimizing the code size to fits in the cache. It's also a
           | bit like defining a whole new instruction set dedicated and
           | optimized for Fizzbuzz!
           | 
           | I think the approach was a bit similar to a VLIW
           | https://en.wikipedia.org/wiki/Very_long_instruction_word
           | where a single 32 bits word is loaded and interpreted at a
           | time.
           | 
           | Most of the state of the Fizzbuzz program is encoded in this
           | bytecode.
           | 
           | > "Slower L3 memory operations", I've not read that sentence
           | before
           | 
           | You hear it all the time when trying to optimize for speed in
           | a hot loop. Keep your data structures in a way that fits in
           | the cache. Align your data with cache boundaries. Code needs
           | to be small too, hence the bytecode approach. This is also
           | the reason why you can't unroll loops too much.
           | 
           | If your implementation does not need to fetch (or write) data
           | from(/to) higher levels of the memory hierarchy, you'll see
           | tremendous gains. There's even this bit in the original
           | source: " _while waiting for writes to go through to the L2
           | cache_ " :)
           | 
           | There are a few more very interesting hacks, such as the
           | "high-decimal" format that skips the " _need to do binary
           | /decimal conversions._"
        
         | jedimastert wrote:
         | > Rustaceans
         | 
         | Completely off topic, but suddenly all of the crab branding
         | makes a _ton_ more sense
        
         | bonzini wrote:
         | The second one (written by me) is actually JIT-compiled
         | assembly. So little of the runtime is spent in the C code that
         | -O0 vs -O3 makes no difference at all; and even if you wrote
         | the glue code in Python it would not be much slower than the C
         | code.
         | 
         | Also the third entry is not the one in Rust. It's just that the
         | Rust entry came later and thus was not included in the plot.
         | The third entry is
         | https://codegolf.stackexchange.com/a/215236/56694; it is also
         | C.
         | 
         | > displace the C implementation in Rust or any other language
         | besides C/C++/ASM
         | 
         | This is not a language contest, it's an algorithmic contest:
         | that is, it's the best algorithm that wins, not the best
         | compiler. The top entry, converted into C or Rust by mostly
         | menial work, would probably be only a few % slower and would
         | displace the C implementation.
        
           | sillysaurusx wrote:
           | I would _love_ to see a Python implementation that approaches
           | the C code. It would be educational on so many levels. Would
           | anyone be interested in trying?
           | 
           | Also, why is the ASM version so much faster if the glue code
           | doesn't make much difference? I suppose it matters at the
           | margins...
        
             | bonzini wrote:
             | > why is the ASM version so much faster if the glue code
             | doesn't make much difference
             | 
             | The remark on the glue code only applies to my entry (the
             | second). The assembly entry uses a completely different
             | (and much more complicated) algorithm than the C entry.
        
             | hectormalot wrote:
             | If you read the description you'll see that it's optimizing
             | around certain syscalls like memcopy (which is otherwise
             | limiting and reduces speed by 5x). I think it's less
             | language choice, and more low level optimization that is
             | making a difference here.
             | 
             | (To check, I wrote a Go program that doen nothing else than
             | fmt.Print($largebody) in a for loop and it tops out at
             | around 10GB/s)
        
               | tux3 wrote:
               | Note that memcpy is not a syscall. Common syscalls do
               | copies. The fast implementation uses the vmsplice syscall
               | as a clever way to pump a pipe with data, because it
               | doesn't copy.
        
               | tialaramex wrote:
               | Indeed. Today for most systems memcpy is in fact a
               | compiler intrinsic. C's standard library offers a memcpy
               | but the reality is that your C standard library was
               | likely supplied by your C compiler vendor, and when you
               | ask for memcmp() in C you get their preferred intrinsic
               | inlined into your program unless you've been very firm in
               | declining that optimisation.
               | 
               | In Rust this memcpy intrinsic is required (for the core
               | language, not just the standard library) because Rust
               | clearly has arbitrarily large data structures you can
               | literally copy, so, somebody has to take responsibility
               | for doing that work, Rust says it's the compiler back-
               | end's job. On a toy CPU the memcpy might really just be a
               | trivial read->register->write loop because that's
               | adequate.
        
               | tialaramex wrote:
               | Oops, I notice hours too late to edit it, that this says
               | memcmp() in one place where it mostly says memcpy(). In
               | fact your C compiler likely has intrinsics for both, but
               | I did mean memcpy() everywhere in the comment.
        
             | Qem wrote:
             | The Python implementation posted among the answers is about
             | 85 MiB/sec, under PyPy.
        
           | animal531 wrote:
           | So what we're seeing is that a specialized handcoded
           | assembler version is 10x faster than a generic C version?
           | 
           | If so that makes me a bit sad, because in the modern day we'd
           | want to assume that compilers have gotten a little better at
           | optimization.
        
             | jiggawatts wrote:
             | Read the assembly.
             | 
             | Seriously. It's eye-popping. It's eldritch madness. It's
             | black wizardry. The blackest of the black arts.
             | 
             | The "fastest FizzBuzz" implementation isn't merely "fast".
             | It's _faster than memcpy()!!!_
             | 
             | Expecting compilers to outperform this tour-de-force is
             | asking a tad too much from compiler writers...
        
               | noduerme wrote:
               | It's actually really nice the way the author commented
               | it. It's a cool insight into their mental process. I've
               | never coded in Assembly and I can't normally follow it.
               | But reading these comments I can understand how the
               | author is surgically managing registers and operating off
               | a complete mental map of what's there, the same way a
               | coder would with, say, a very large database they have a
               | complete mental model of.
               | 
               | All I mean is that I appreciate the author making their
               | thought process accessible. It certainly looks like
               | virtuosity, but I'm not competent enough to judge.
        
               | btilly wrote:
               | From the article, _" The application produced 64 bytes of
               | FizzBuzz for every 4 CPU clock cycles. The author states
               | the ultimate bottleneck of performance is based on the
               | throughput of the CPU's L2 cache."_
               | 
               | For comparison, that time is on par with integer
               | multiplication.
        
             | bonzini wrote:
             | > assume that compilers have gotten a little better at
             | optimization
             | 
             | Again: you're comparing the performance of different
             | algorithms. A compiler won't turn bubblesort into
             | quicksort.
        
               | tialaramex wrote:
               | I mean, this is importantly true as a generalisation,
               | but, it's worth knowing modern compilers know about
               | _idioms_.
               | 
               | If you show a modern compiler the typical idiomatic blob
               | of code for counting the set bits in a word in a language
               | like C, the compiler goes "Oh, _popcount_ I know this "
               | and when targeting modern CPUs it emits a single
               | instruction, the popcount instruction for that CPU.
               | 
               | In APL idiom recognition was essential to making even a
               | halfway decent compiler. It's easy, and idiomatic, to
               | write APL which would be tremendously wasteful if
               | implemented naively, while recognising common idioms
               | could remove lots of that waste without needing the level
               | of sophistication modern compilers have during
               | optimisation.
               | 
               | Because the compiler knows about idioms you should _not_
               | try to optimise your programs by writing non-idiomatic
               | code _until_ you 've determined you have a performance
               | problem and the non-idiomatic code is identified as a way
               | to fix it. Otherwise it may be that the compiler improves
               | and now your code is not only harder to maintain it also
               | has worse performance than the idiomatic solution.
        
               | rowanG077 wrote:
               | Not yet. But I have no doubt in my mind that a once we
               | have smarter then human AI there will be compilers that
               | absolutely can turn bubblesort into quicksort.
        
               | marginalia_nu wrote:
               | Seems fairly unlikely. The best sorting algorithm is a
               | function of the data. The compiler can't know this. There
               | are pathological data sets where quicksort is O(n^2). For
               | small n, quick sort is also a bad choice.
               | 
               | The programmer can understand these domain
               | characteristics but the compiler certainly can't. If code
               | isn't running qsort(), this could be a mistake, but may
               | also be an optimization.
        
               | gpderetta wrote:
               | PGO is a thing for this reason.
               | 
               | (But no, I'm not expecting AI driven compilers anytime
               | soon)
        
               | rowanG077 wrote:
               | You almost always know something about the data that a
               | sorting function will operate on. Unless of course you
               | are implementing a generic sorting algorithm. In practice
               | you could train your compiler on real world data to allow
               | it to optimize better for your environment. This is
               | already happening with profile guided optimizations.
        
             | nojs wrote:
             | The assembly is ridiculous. Random section:
             | // The bytecode generation loop is unrolled to depth 30,
             | allowing the         // units digits to be hardcoded. The
             | tens digit is stored in %al, and         // incremented
             | every ten lines of output. The parity of the hundreds
             | // digit is stored in %ymm2: one half predicts the hundreds
             | digit to         // be even, the other to be odd, and the
             | halves are swapped every time         // the tens digit
             | carries (ensuring the predictions are correct).
        
               | entropicdrifter wrote:
               | Controlled CPU predictions, wow. That's a level of
               | performance I've never even contemplated
        
               | bonzini wrote:
               | It's not CPU predictions, it means that it computes in
               | parallel one result that is correct and one that is
               | wrong, and picks the right one by swapping the two halves
               | every now and then.
        
           | stavros wrote:
           | Now I'm kind of hoping that someone will rewrite the glue in
           | Python so that Python is the second fastest implementation of
           | them all, just for the laughs.
        
         | [deleted]
        
         | gopiandcode wrote:
         | You're totally right. I think a lot of rust developers are
         | going to be absolutely distraught that rust isn't the fastest
         | language for writing fizzbuzz programs.
         | 
         | I guess they're just going to have to settle with rust being
         | the best language to write safety-critical high performance
         | concurrent low-level code.
        
         | yodelshady wrote:
         | FYI, on my machine a naive Rust implementation (modulo plus
         | println!) runs at the same speed as a naive _Python_
         | implemention (16-17 MB /s), even compiled with --release. Naive
         | C manages 160-170 MB/s. (And I fully admit the naive
         | implementations are pretty much as far as I'm going with this).
         | 
         | I would guess it's the println! macro eating the time.
        
           | pcwalton wrote:
           | It's probably stdout locking. Call Stdout::lock.
        
           | yodelshady wrote:
           | edit: weirder, naive Julia (1.7 on Linux) is _terrible_. 45
           | KB /s.
           | 
           | Ruby is slightly slower than Python or Rust, 11 MB/s. I was
           | expecting Python to be slowest among the interpreted
           | languages, apparently not. Pypy yields a speedup to 70 MB/s.
           | 
           | Summing naive implementations (for each i: start with empty
           | string, append fizz/buzz based on the modulus of each, then
           | conditionally append i and print):
           | 
           | Julia 45 KB/s Ruby 11 MB/s CPython 15 MB/s Rust 16 MB/s Pypy
           | 70 MB/s C 160 MB/s
        
             | Qem wrote:
             | I think recent versions of Ruby include a optional JIT
             | compiler. Did you try it?
        
             | sc11 wrote:
             | Do you mind sharing the Julia code? Seems odd that it's
             | that far away from even Python.
        
               | adgjlsfhk1 wrote:
               | by default Julia uses locking for writes to guarantee
               | thread safety when multiple writes are happening. That
               | said, it's totally possible to go way faster using
               | buffered IO.
        
           | beltsazar wrote:
           | There are two reasons: 1) The println! macro will lock and
           | unlock the stdout every time it's called. 2) Rust's stdout is
           | line-buffered by default.
           | 
           | In my machine the naive C implementation achieves 81.7 MiB/s,
           | whereas a naive Rust implementation with println achieves
           | 7.58 MiB/s.
           | 
           | When I acquired the stdout's lock once and use writeln!(&mut
           | stdout, ...) instead of println, I got 8.15 MiB/s.
           | 
           | When I wrapped stdout with BufWriter, I got 299 MiB/s.
           | use std::io::{BufWriter, Write};              fn main() {
           | let stdout = std::io::stdout();             let stdout =
           | stdout.lock();             let mut stdout =
           | BufWriter::with_capacity(32 * 1024, stdout);
           | for i in 1..1_000_000_000 {                 let _ = match (i
           | % 3 == 0, i % 5 == 0) {                     (true, true) =>
           | writeln!(&mut stdout, "FizzBuzz"),                     (true,
           | false) => writeln!(&mut stdout, "Fizz"),
           | (false, true) => writeln!(&mut stdout, "Buzz"),
           | (false, false) => writeln!(&mut stdout, "{}", i),
           | };             }         }
        
         | pcwalton wrote:
         | You can basically take any LLVM IR that Clang would produce and
         | produce Rust that compiles down to it (modulo irreducible
         | control flow, I think, but the C++ doesn't have that). Because
         | of that, C++ vs. Rust _language_ (as opposed to library)
         | performance comparisons are uninteresting.
         | 
         | > It's a nice reminder that for the ultimate performance, there
         | is no substitute for knowing exactly what the assembler is
         | doing.
         | 
         | C and C++ do not let you know exactly what the assembler is
         | doing. I work on compilers for a living and despite that--
         | perhaps _because_ of that--I have no confidence that I can
         | predict what instructions the compiler will generate ahead of
         | time. If I need to know the actual instructions, I run my code
         | through Godbolt and check. C, C++, and Rust are all millions of
         | lines of LLVM optimization passes away from the metal.
         | 
         | Based on some other comments, I suspect that all the time is
         | being spent in locking stdout, which you can amortize by using
         | Stdout::lock [1]. As is so often the case when microbenchmarks
         | are discussed online, this has nothing to do with the language.
         | 
         | Also, it shouldn't need to be said, but FizzBuzz is so totally
         | unrepresentative of real workloads that you shouldn't draw any
         | conclusions about language performance from it.
         | 
         | [1]: https://doc.rust-
         | lang.org/stable/std/io/struct.Stdout.html#m...
        
           | parhamn wrote:
           | > LLVM IR that Clang would produce
           | 
           | You most certainly can't unless your code is unsafe and in
           | that case the comparison isn't so interesting anymore.
        
             | Aissen wrote:
             | It is very interesting if you have properly verified
             | invariants and you can guarantee (with manual checking)
             | that your code is correct. It's an order of magnitude
             | harder than just letting the borrow checker do its job, but
             | it's entirely possible (just like in any other language).
        
       ___________________________________________________________________
       (page generated 2021-12-02 23:04 UTC)