[HN Gopher] 55 GiB/s FizzBuzz
       ___________________________________________________________________
        
       55 GiB/s FizzBuzz
        
       Author : luu
       Score  : 360 points
       Date   : 2021-10-28 20:29 UTC (2 hours ago)
        
 (HTM) web link (codegolf.stackexchange.com)
 (TXT) w3m dump (codegolf.stackexchange.com)
        
       | kens wrote:
       | You could probably get some blazing performance out of an FPGA. I
       | made an FPGA version of FizzBuzz a few years ago, but it was
       | optimized for pointless VGA animation rather than performance.
       | 
       | https://www.righto.com/2018/04/fizzbuzz-hard-way-generating-...
        
         | NextHendrix wrote:
         | When I saw this I did wonder how much faster I could do it in
         | hardware, but similarly I expect the bottleneck would be
         | outputting it in a fashion that would be considered fair to be
         | compared to software implementations (eg outputting readable
         | text on a screen).
         | 
         | Regardless, I very much enjoyed your DVD screensaver-esque
         | output.
        
       | fnord77 wrote:
       | > // Most FizzBuzz routines produce output with `write` or a
       | similar         > // system call, but these have the disadvantage
       | that they need to copy         > // the data being output from
       | userspace into kernelspace. It turns out         > // that when
       | running full speed (as seen in the third phase), FizzBuzz
       | > // actually runs faster than `memcpy` does, so `write` and
       | friends are         > // unusable when aiming for performance -
       | this program runs five times         > // faster than an
       | equivalent that uses `write`-like system calls.
       | 
       | Why can't `write` use a reference like vmsplice?
        
         | tehjoker wrote:
         | I'm not an assembly programmer, but my intuition is that this
         | is safer. You can only rely on "zero-copy" behavior when you
         | have total control of the program and know that that memory
         | region is going to stay put and remain uncorrupted. Therefore,
         | most external code will make a copy because they can't insist
         | on this (and because for most people, making a copy is pretty
         | fast).
        
           | gpderetta wrote:
           | I think that some of unix derivatives do or have done in the
           | past. If the argument to write is page aligned and a multiple
           | of a page size they play VM tricks to implement zero copy
           | writes.
           | 
           | The issue is that the caller to write is allowed to do
           | anything it wants with the buffer after write returns, so the
           | write implementation need to unmap the stolen pages and
           | perform copy on write if the caller ever touches them again,
           | so in practice the optimization is very fragile. For this
           | reason Linus as always refused to implement this
           | optimization.
           | 
           | Splicevm gets away with it because it is part of the caller
           | contract that it can't ever touch the pages until the kernel
           | is done with them. Unfortunately there is no general way to
           | know when it is safe to reuse them and it is very application
           | specific (for example there might be an explicit ack from the
           | consumer that it has received the data)
        
       | nobrains wrote:
       | Did anyone else get recommended and see the FizzBuzz video on
       | YouTube ( https://youtu.be/QPZ0pIK_wsc ) just before this article
       | or did I just witness an incredible coincidence?
        
         | yreg wrote:
         | See https://en.wikipedia.org/wiki/Frequency_illusion
        
       | jedberg wrote:
       | I wonder if it slows down with really big numbers.
        
       | dmitrygr wrote:
       | The amount of completely unnecessary effort unleashed on this
       | problem in this post is amazing. I want to meet the author and
       | shake his hand! It has everything from Linux kernel trivia in
       | terms of output speed to intel AVX2 code. So unnecessary and so
       | awesome!
       | 
       | I've done stuff like this before, and I imagine the satisfaction
       | of completing it! A tip of my hat to you, sir!
        
       | jiggawatts wrote:
       | " This is faster than a number of standard functions. In
       | particular, it's faster than memcpy, which presents interesting
       | challenges when it comes to I/O"
       | 
       | Wow.
        
       | kvathupo wrote:
       | > 64 bytes of FizzBuzz per 4 clock cycles
       | 
       | That's borderline incredulous, given a single AVX2 instruction
       | can last multiple clock-cycles. The reciprocal throughput also
       | doesn't go below ~0.3 to my, admittedly shallow, knowledge. A
       | remarkable piece of engineering!
        
       | ot wrote:
       | To me it's almost as impressive that pv does not become the
       | bottleneck.
        
       | HALtheWise wrote:
       | Now I'm kinda curious to see how much faster you could go on an
       | M1 Max with the GPU generating the data. Once his solution gets
       | to the point of being a bytecode interpreter, it's trivially
       | paralellizable and the M1 has _fantastic_ memory bandwidth. Does
       | anyone know if the implementation of pv or /dev/null actually
       | requires loading the data into CPU cache?
        
         | jason0597 wrote:
         | Couldn't we load it onto an NVIDIA RTX A6000? It is much much
         | faster than the M1 Max. It has a much greater memory bandwith
         | too
        
           | creddit wrote:
           | Wouldn't benefit from UMA
        
       | monksy wrote:
       | I'm not sure why there is coding done here.
       | 
       | A ticket for this work has been made at here:
       | https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...
       | 
       | I'm going to have to ask that the contributors attend the
       | grooming sessions and talk about the business value of this
       | performance enhancement.
        
         | einpoklum wrote:
         | I like how it has a
         | FizzBuzzOutputGenerationContextVisitorFactory.java
         | 
         | file.
        
         | [deleted]
        
       | kruxigt wrote:
       | How common is optimization this skilled in parts of what we run
       | everyday? Is it somewhere in some inner loop of the browser, some
       | parts of the OS? Graphics drivers? BLAS and similar? The demo
       | scene?
        
       | arberx wrote:
       | How do people have so much time on their hands :(
        
         | recursive wrote:
         | Everyone has 24 hours per day.
        
         | aloisdg wrote:
         | Maybe they lose less time writing comment on HN. Dont hit me I
         | am guilty of this too.
        
         | abraae wrote:
         | I have trouble with procrastination, but since watching Arnie's
         | take on this I cannot get it out of my head and I have actually
         | found it helpful (yet here I am on HN again :()
         | 
         | https://www.youtube.com/watch?v=u_ktRTWMX3M
        
       | jtchang wrote:
       | The amount of low level CPU architecture knowledge to write such
       | a program is mind boggling. Just goes to show how much room for
       | improvement a lot of programs have.
        
         | loser777 wrote:
         | FizzBuzz has many properties that make it very suitable for
         | these kinds of optimizations that might not be applicable to
         | general purpose code: + extremely small working set (a few
         | registers worth of state) + extremely predictable branching
         | behavior + no I/O
         | 
         | These properties however don't diminish the achievement of
         | leveraging AVX-2 (or any vectorization) for a problem that
         | doesn't immediately jump out as SIMD.
        
           | einpoklum wrote:
           | > for a problem that doesn't immediately jump out as SIMD.
           | 
           | It's a single transformation applied to elements of a large
           | array. Parallelization is obvious, and maybe the exact
           | SIMDization is not obvious, one is certainly motivated to
           | formulate it. And a scatter-like approach does spring to mind
           | right away, I believe.
        
           | brudgers wrote:
           | FizzBuzz is a constant.
           | 
           | It should never branch.
        
           | [deleted]
        
           | yumraj wrote:
           | I think you're missing the point. The issue is that with the
           | advent of higher level languages, starting from Java and on
           | to Python and so on, most people have forgotten or have never
           | learnt the skills to optimize code.
           | 
           | I'll argue that, as a _percent_ , the number of people who
           | can write proper multi-threaded code has only diminished over
           | the years.
           | 
           | And we see the result, despite the massive increase in
           | computing power, software in general has become slower and
           | bloated.
        
             | foota wrote:
             | I think the kind of optimization here is beyond that which
             | people used to do. AVX didn't exist when it was common to
             | write assembly.
        
           | cogman10 wrote:
           | > no I/O
           | 
           | The problem description is writing out bytes which is
           | probably some of the more expensive part of this. In fact, if
           | you read the winning solution description, IO is the primary
           | problem here.
           | 
           | > doesn't immediately jump out as SIMD.
           | 
           | IDK that I agree with this assessment. Very naively, I see no
           | reason you'd not take the 512 SIMD registers and split them
           | into 16 32 bit lanes. From there, it's a relatively simple
           | matter of using 2 registers for the divisors, pulling out the
           | results, and transforming them into the text to print. In
           | other words, you be chunking this up into 16 iterations per
           | loop. (vs 1 with the naive assembly).
           | 
           | This is the sort of thing that jumps out as easily
           | vectorizable.
           | 
           | Now, the fastest answer very obviously does not take this
           | approach because I'm certain they realized the same thing,
           | that the difficult part here isn't the actual division, but
           | instead pumping out the correct text at the highest speed
           | possible. If you read through it, most of the code is
           | dedicated to converting binary numbers into ascii :D
        
             | loser777 wrote:
             | Maybe I should be more clear; no data needs to be fetched
             | from disk/network, and the "writes" don't need to go past
             | memory.
             | 
             | As for the second point, you might have a different
             | definition of "naive" and "relatively simple" as my brain
             | has rotted too much from only thinking about SIMD for
             | numerical computation. While determining divisibility be
             | relatively clear, it wasn't clear how the printing would be
             | easily vectorizable as the output-per-number is variable in
             | length.
        
             | dvt wrote:
             | > transforming them into the text to print
             | 
             | I think this nails it. Vectorizing the math problem is
             | "easy" (send batches of numbers to cores, do the division)
             | but then you have to re-order it for printing (not to
             | mention actually _print_ it), so paradoxically, it probably
             | makes more sense for the program to be single-threaded.
        
               | yuliyp wrote:
               | You don't even need to do division. There are only 3
               | patterns that repeat for every set of 10 numbers. You
               | just need to track which one you're on.
        
               | dvt wrote:
               | > There are only 3 patterns that repeat for every set of
               | 10 numbers
               | 
               | Ah, yep, good point!
        
         | tester756 wrote:
         | It shows that abstractions are leaky as f :)
        
           | munchler wrote:
           | No, this simply shows that abstraction slows performance,
           | which is usually a worthwhile tradeoff.
           | 
           | Leaky abstractions are a different problem altogether.
        
             | bigiain wrote:
             | Yep. I suspect most people here could write a working
             | fizzbuzz on a whiteboard in language of choice in under 5
             | mins during a job interview.
             | 
             | Sure your Python/JavaScript/Haskell/Rust version builds on
             | a bunch of abstractions, but it'll run on just about
             | anything, and ...
             | 
             | "I've spent months working on this program"
             | 
             | That's not what your boss wants to hear.
        
               | snovv_crash wrote:
               | You can pump out a Modern C++ version in 5 minutes too
               | that will run loops (haha) around the higher level
               | languages. The readability won't even be very
               | different...
        
         | golergka wrote:
         | Also goes to show how much would it cost.
        
           | dorianmariefr wrote:
           | And maintain and evolve and debug and work on different
           | machines, etc.
        
       | Scaevolus wrote:
       | This is the heart of this program. It aims to be able to produce
       | a         sustained output rate of 64 bytes of FizzBuzz per four
       | clock cycles         in its main loop (with frequent breaks to do
       | I/O, and rare breaks         to do more expensive calculations).
       | The third phase operates primarily using a bytecode interpreter;
       | it         generates a program in "FizzBuzz bytecode", for which
       | each byte of         bytecode generates one byte of output. The
       | bytecode language is         designed so that it can be
       | interpreted using SIMD instructions; 32         bytes of bytecode
       | can be loaded from memory, interpreted, and have         its
       | output stored back into memory using just four machine
       | instructions. This makes it possible to speed up the FizzBuzz
       | calculations by hardcoding some of the calculations into the
       | bytecode (this is similar to how JIT compilers can create a
       | version         of the program with some variables hardcoded, and
       | throw it away on         the rare occasions that those variables'
       | values change).              The bytecode format is very simple
       | (in order to allow it to be         interpreted in just a couple
       | of machine instructions):         - A negative byte represents a
       | literal character (e.g. to produce           a literal 'F', you
       | use the bytecode -'F', i.e. -70 = 0xba)         - A byte 0..7
       | represents the hundreds..billions digit of the line
       | number respectively, and asserts that the hundreds digit of the
       | line number is even         - A byte 8..15 represents the
       | hundreds..billions digit of the line           number
       | respectively, and asserts that the hundreds digit of the
       | line number is odd              In other words, the bytecode
       | program only ever needs to read from         LINENO_MID; the
       | information stored in LINENO_LOW and LINENO_TOP         therefore
       | has to be hardcoded into it. The program therefore needs
       | to be able to generate 600 lines of output (as the smallest
       | number         that's divisible by 100 to be able to hardcode the
       | two low digits,         200 to be able to get the assertions
       | about the hundreds digits         correct, and 3 and 5 to get the
       | Fizzes and Buzzes in the right         place).              The
       | bytecode interpreter consists of four instructions:         1.
       | Load the bytecode from memory into %ymm2;         2. Use it as a
       | shuffle mask to shuffle LINENO_MID_TEMP;         3. Subtract the
       | bytecode from the shuffle result;         4. Output the result of
       | the subtraction.              #define
       | INTERPRET_BYTECODE(bc_offset, buf_offset) \           vmovdqu
       | %ymm2, [%rdx + bc_offset]; \           vpshufb %ymm0,
       | LINENO_MID_TEMP, %ymm2; \           vpsubb %ymm0, %ymm0, %ymm2; \
       | vmovdqa [OUTPUT_PTR + buf_offset], %ymm0
        
       | savant_penguin wrote:
       | "future-proofed where possible to be able to run faster if the
       | relevant processor bottleneck - L2 cache write speed - is ever
       | removed)."
       | 
       | Being able to write a function limited mostly by the l2 cache
       | size and able to realize that is rad
       | 
       | And btw this is an interesting example of how hand optimized
       | assembly can be much much faster than any other solution. Can you
       | get as fast as this solution with mostly C/C++? It uses
       | interesting tricks to avoid memcopy (calling it slow rofl)
        
         | BeeOnRope wrote:
         | You could probably to very close to this solution with C or
         | C++, plus AVX intrinsics. Some might consider that "cheating"
         | since intrinsics occupy kind of a grey area between a higher
         | level language and asm.
        
           | bigiain wrote:
           | I suspect by the time you've disabled all the compiler
           | "optimisations" that would lie knock a few orders of
           | magnitude of performance off the directly translated
           | algorithm, you may as well have written the assembler to
           | start with...
           | 
           | And you probably can't fine tune your C/C++ to get this
           | performance without knowing exactly what processor
           | instructions you are trying to trick the compiler into
           | generating anyway.
        
             | kccqzy wrote:
             | > without knowing exactly what processor instructions you
             | are trying to trick the compiler into generating
             | 
             | In fact, in this case you know exactly what processor
             | instructions the compiler is going to generate. You are
             | using AVX intrinsics after all.
             | 
             | And no, compiler optimizations work well with these
             | intrinsics.
        
       | m0ck wrote:
       | Thanks for my daily dose of software engineer imposter syndrome.
        
         | btheshoe wrote:
         | It never goes away, does it
        
         | andkon wrote:
         | I was very grateful to see this disclaimer:
         | 
         | > I've spent months working on this program
        
       | no_time wrote:
       | next up: custom FizzBuzz ASIC with 128 fizz and 128 buzz cores
       | for maximum throughput.
        
         | mordechai9000 wrote:
         | I think optimizing binary tree inversion is a higher priority,
         | right now.
        
           | WJW wrote:
           | Not for fizzbuzz optimisation it isn't! Get your priorities
           | straight mate! Who is trying to get hired when we could be
           | outputting more fizzes and/or buzzes! :)
        
         | function_seven wrote:
         | Substandard design. You should have a suitable ratio of cores
         | to keep them load balanced. I suggest 160 fizz cores and 96
         | buzz cores.
         | 
         | EDIT: And a fizzbuzz co-processor, obviously.
        
           | gumby wrote:
           | > fizzbuzz co-processor, obviously
           | 
           | One of the functional modules on the M1 Pro Max chip, but
           | only the Max one, unfortunately.
           | 
           | Take that, Intel!
        
         | randy408 wrote:
         | FizzBuzzCoin?
        
         | wtallis wrote:
         | Or if your ASIC generalizes the problem in a slightly different
         | direction, you end up reinventing TWINKLE and TWIRL:
         | https://en.wikipedia.org/wiki/TWINKLE
        
         | nonameiguess wrote:
         | Could probably store all multiples of 3 and 5 up to some really
         | big number burned directly to hardware and then do something
         | like a big CAM table the way Internet core routers do mapping
         | the numbers to ASCII bit strings. Then speedup IO by not having
         | a general purpose display, but something more like an old-
         | school digital clock that can only show digits and the words
         | "fizz" and "buzz."
        
           | function_seven wrote:
           | You'll get docked points from the typography nerd for
           | displaying "fizz" instead of "fizz"
        
       | jerf wrote:
       | "The Grid. A digital frontier. I tried to picture clusters of
       | information as they moved through the computer. What did they
       | look like? Ships, motorcycles? Were the circuits like freeways? I
       | kept dreaming of a world I thought I'd never see. And then, one
       | day I got in...
       | 
       | It turns out The Grid is just a guy sitting in a chair, shouting
       | about "Fizz!" and "Buzz!" as fast as he can.
       | 
       | It wasn't really what I had in mind.
       | 
       | (The image of this poor program, stuck shouting "fizz!" and
       | "buzz!" for subjectively centuries at a time struck me...)
        
         | jedberg wrote:
         | I kinda hope as a joke in Tron 3 they have a guy muttering in
         | the corner "1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz" as someone
         | walks by.
        
         | tigershark wrote:
         | Still better than throwing away a significant percentage of the
         | world total power output just to compute some useless hashes if
         | you ask me..
        
       ___________________________________________________________________
       (page generated 2021-10-28 23:00 UTC)