[HN Gopher] 55 GiB/s FizzBuzz
___________________________________________________________________
55 GiB/s FizzBuzz
Author : luu
Score : 360 points
Date : 2021-10-28 20:29 UTC (2 hours ago)
(HTM) web link (codegolf.stackexchange.com)
(TXT) w3m dump (codegolf.stackexchange.com)
| kens wrote:
| You could probably get some blazing performance out of an FPGA. I
| made an FPGA version of FizzBuzz a few years ago, but it was
| optimized for pointless VGA animation rather than performance.
|
| https://www.righto.com/2018/04/fizzbuzz-hard-way-generating-...
| NextHendrix wrote:
| When I saw this I did wonder how much faster I could do it in
| hardware, but similarly I expect the bottleneck would be
| outputting it in a fashion that would be considered fair to be
| compared to software implementations (eg outputting readable
| text on a screen).
|
| Regardless, I very much enjoyed your DVD screensaver-esque
| output.
| fnord77 wrote:
| > // Most FizzBuzz routines produce output with `write` or a
| similar > // system call, but these have the disadvantage
| that they need to copy > // the data being output from
| userspace into kernelspace. It turns out > // that when
| running full speed (as seen in the third phase), FizzBuzz
| > // actually runs faster than `memcpy` does, so `write` and
| friends are > // unusable when aiming for performance -
| this program runs five times > // faster than an
| equivalent that uses `write`-like system calls.
|
| Why can't `write` use a reference like vmsplice?
| tehjoker wrote:
| I'm not an assembly programmer, but my intuition is that this
| is safer. You can only rely on "zero-copy" behavior when you
| have total control of the program and know that that memory
| region is going to stay put and remain uncorrupted. Therefore,
| most external code will make a copy because they can't insist
| on this (and because for most people, making a copy is pretty
| fast).
| gpderetta wrote:
| I think that some of unix derivatives do or have done in the
| past. If the argument to write is page aligned and a multiple
| of a page size they play VM tricks to implement zero copy
| writes.
|
| The issue is that the caller to write is allowed to do
| anything it wants with the buffer after write returns, so the
| write implementation need to unmap the stolen pages and
| perform copy on write if the caller ever touches them again,
| so in practice the optimization is very fragile. For this
| reason Linus as always refused to implement this
| optimization.
|
| Splicevm gets away with it because it is part of the caller
| contract that it can't ever touch the pages until the kernel
| is done with them. Unfortunately there is no general way to
| know when it is safe to reuse them and it is very application
| specific (for example there might be an explicit ack from the
| consumer that it has received the data)
| nobrains wrote:
| Did anyone else get recommended and see the FizzBuzz video on
| YouTube ( https://youtu.be/QPZ0pIK_wsc ) just before this article
| or did I just witness an incredible coincidence?
| yreg wrote:
| See https://en.wikipedia.org/wiki/Frequency_illusion
| jedberg wrote:
| I wonder if it slows down with really big numbers.
| dmitrygr wrote:
| The amount of completely unnecessary effort unleashed on this
| problem in this post is amazing. I want to meet the author and
| shake his hand! It has everything from Linux kernel trivia in
| terms of output speed to intel AVX2 code. So unnecessary and so
| awesome!
|
| I've done stuff like this before, and I imagine the satisfaction
| of completing it! A tip of my hat to you, sir!
| jiggawatts wrote:
| " This is faster than a number of standard functions. In
| particular, it's faster than memcpy, which presents interesting
| challenges when it comes to I/O"
|
| Wow.
| kvathupo wrote:
| > 64 bytes of FizzBuzz per 4 clock cycles
|
| That's borderline incredulous, given a single AVX2 instruction
| can last multiple clock-cycles. The reciprocal throughput also
| doesn't go below ~0.3 to my, admittedly shallow, knowledge. A
| remarkable piece of engineering!
| ot wrote:
| To me it's almost as impressive that pv does not become the
| bottleneck.
| HALtheWise wrote:
| Now I'm kinda curious to see how much faster you could go on an
| M1 Max with the GPU generating the data. Once his solution gets
| to the point of being a bytecode interpreter, it's trivially
| paralellizable and the M1 has _fantastic_ memory bandwidth. Does
| anyone know if the implementation of pv or /dev/null actually
| requires loading the data into CPU cache?
| jason0597 wrote:
| Couldn't we load it onto an NVIDIA RTX A6000? It is much much
| faster than the M1 Max. It has a much greater memory bandwith
| too
| creddit wrote:
| Wouldn't benefit from UMA
| monksy wrote:
| I'm not sure why there is coding done here.
|
| A ticket for this work has been made at here:
| https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...
|
| I'm going to have to ask that the contributors attend the
| grooming sessions and talk about the business value of this
| performance enhancement.
| einpoklum wrote:
| I like how it has a
| FizzBuzzOutputGenerationContextVisitorFactory.java
|
| file.
| [deleted]
| kruxigt wrote:
| How common is optimization this skilled in parts of what we run
| everyday? Is it somewhere in some inner loop of the browser, some
| parts of the OS? Graphics drivers? BLAS and similar? The demo
| scene?
| arberx wrote:
| How do people have so much time on their hands :(
| recursive wrote:
| Everyone has 24 hours per day.
| aloisdg wrote:
| Maybe they lose less time writing comment on HN. Dont hit me I
| am guilty of this too.
| abraae wrote:
| I have trouble with procrastination, but since watching Arnie's
| take on this I cannot get it out of my head and I have actually
| found it helpful (yet here I am on HN again :()
|
| https://www.youtube.com/watch?v=u_ktRTWMX3M
| jtchang wrote:
| The amount of low level CPU architecture knowledge to write such
| a program is mind boggling. Just goes to show how much room for
| improvement a lot of programs have.
| loser777 wrote:
| FizzBuzz has many properties that make it very suitable for
| these kinds of optimizations that might not be applicable to
| general purpose code: + extremely small working set (a few
| registers worth of state) + extremely predictable branching
| behavior + no I/O
|
| These properties however don't diminish the achievement of
| leveraging AVX-2 (or any vectorization) for a problem that
| doesn't immediately jump out as SIMD.
| einpoklum wrote:
| > for a problem that doesn't immediately jump out as SIMD.
|
| It's a single transformation applied to elements of a large
| array. Parallelization is obvious, and maybe the exact
| SIMDization is not obvious, one is certainly motivated to
| formulate it. And a scatter-like approach does spring to mind
| right away, I believe.
| brudgers wrote:
| FizzBuzz is a constant.
|
| It should never branch.
| [deleted]
| yumraj wrote:
| I think you're missing the point. The issue is that with the
| advent of higher level languages, starting from Java and on
| to Python and so on, most people have forgotten or have never
| learnt the skills to optimize code.
|
| I'll argue that, as a _percent_ , the number of people who
| can write proper multi-threaded code has only diminished over
| the years.
|
| And we see the result, despite the massive increase in
| computing power, software in general has become slower and
| bloated.
| foota wrote:
| I think the kind of optimization here is beyond that which
| people used to do. AVX didn't exist when it was common to
| write assembly.
| cogman10 wrote:
| > no I/O
|
| The problem description is writing out bytes which is
| probably some of the more expensive part of this. In fact, if
| you read the winning solution description, IO is the primary
| problem here.
|
| > doesn't immediately jump out as SIMD.
|
| IDK that I agree with this assessment. Very naively, I see no
| reason you'd not take the 512 SIMD registers and split them
| into 16 32 bit lanes. From there, it's a relatively simple
| matter of using 2 registers for the divisors, pulling out the
| results, and transforming them into the text to print. In
| other words, you be chunking this up into 16 iterations per
| loop. (vs 1 with the naive assembly).
|
| This is the sort of thing that jumps out as easily
| vectorizable.
|
| Now, the fastest answer very obviously does not take this
| approach because I'm certain they realized the same thing,
| that the difficult part here isn't the actual division, but
| instead pumping out the correct text at the highest speed
| possible. If you read through it, most of the code is
| dedicated to converting binary numbers into ascii :D
| loser777 wrote:
| Maybe I should be more clear; no data needs to be fetched
| from disk/network, and the "writes" don't need to go past
| memory.
|
| As for the second point, you might have a different
| definition of "naive" and "relatively simple" as my brain
| has rotted too much from only thinking about SIMD for
| numerical computation. While determining divisibility be
| relatively clear, it wasn't clear how the printing would be
| easily vectorizable as the output-per-number is variable in
| length.
| dvt wrote:
| > transforming them into the text to print
|
| I think this nails it. Vectorizing the math problem is
| "easy" (send batches of numbers to cores, do the division)
| but then you have to re-order it for printing (not to
| mention actually _print_ it), so paradoxically, it probably
| makes more sense for the program to be single-threaded.
| yuliyp wrote:
| You don't even need to do division. There are only 3
| patterns that repeat for every set of 10 numbers. You
| just need to track which one you're on.
| dvt wrote:
| > There are only 3 patterns that repeat for every set of
| 10 numbers
|
| Ah, yep, good point!
| tester756 wrote:
| It shows that abstractions are leaky as f :)
| munchler wrote:
| No, this simply shows that abstraction slows performance,
| which is usually a worthwhile tradeoff.
|
| Leaky abstractions are a different problem altogether.
| bigiain wrote:
| Yep. I suspect most people here could write a working
| fizzbuzz on a whiteboard in language of choice in under 5
| mins during a job interview.
|
| Sure your Python/JavaScript/Haskell/Rust version builds on
| a bunch of abstractions, but it'll run on just about
| anything, and ...
|
| "I've spent months working on this program"
|
| That's not what your boss wants to hear.
| snovv_crash wrote:
| You can pump out a Modern C++ version in 5 minutes too
| that will run loops (haha) around the higher level
| languages. The readability won't even be very
| different...
| golergka wrote:
| Also goes to show how much would it cost.
| dorianmariefr wrote:
| And maintain and evolve and debug and work on different
| machines, etc.
| Scaevolus wrote:
| This is the heart of this program. It aims to be able to produce
| a sustained output rate of 64 bytes of FizzBuzz per four
| clock cycles in its main loop (with frequent breaks to do
| I/O, and rare breaks to do more expensive calculations).
| The third phase operates primarily using a bytecode interpreter;
| it generates a program in "FizzBuzz bytecode", for which
| each byte of bytecode generates one byte of output. The
| bytecode language is designed so that it can be
| interpreted using SIMD instructions; 32 bytes of bytecode
| can be loaded from memory, interpreted, and have its
| output stored back into memory using just four machine
| instructions. This makes it possible to speed up the FizzBuzz
| calculations by hardcoding some of the calculations into the
| bytecode (this is similar to how JIT compilers can create a
| version of the program with some variables hardcoded, and
| throw it away on the rare occasions that those variables'
| values change). The bytecode format is very simple
| (in order to allow it to be interpreted in just a couple
| of machine instructions): - A negative byte represents a
| literal character (e.g. to produce a literal 'F', you
| use the bytecode -'F', i.e. -70 = 0xba) - A byte 0..7
| represents the hundreds..billions digit of the line
| number respectively, and asserts that the hundreds digit of the
| line number is even - A byte 8..15 represents the
| hundreds..billions digit of the line number
| respectively, and asserts that the hundreds digit of the
| line number is odd In other words, the bytecode
| program only ever needs to read from LINENO_MID; the
| information stored in LINENO_LOW and LINENO_TOP therefore
| has to be hardcoded into it. The program therefore needs
| to be able to generate 600 lines of output (as the smallest
| number that's divisible by 100 to be able to hardcode the
| two low digits, 200 to be able to get the assertions
| about the hundreds digits correct, and 3 and 5 to get the
| Fizzes and Buzzes in the right place). The
| bytecode interpreter consists of four instructions: 1.
| Load the bytecode from memory into %ymm2; 2. Use it as a
| shuffle mask to shuffle LINENO_MID_TEMP; 3. Subtract the
| bytecode from the shuffle result; 4. Output the result of
| the subtraction. #define
| INTERPRET_BYTECODE(bc_offset, buf_offset) \ vmovdqu
| %ymm2, [%rdx + bc_offset]; \ vpshufb %ymm0,
| LINENO_MID_TEMP, %ymm2; \ vpsubb %ymm0, %ymm0, %ymm2; \
| vmovdqa [OUTPUT_PTR + buf_offset], %ymm0
| savant_penguin wrote:
| "future-proofed where possible to be able to run faster if the
| relevant processor bottleneck - L2 cache write speed - is ever
| removed)."
|
| Being able to write a function limited mostly by the l2 cache
| size and able to realize that is rad
|
| And btw this is an interesting example of how hand optimized
| assembly can be much much faster than any other solution. Can you
| get as fast as this solution with mostly C/C++? It uses
| interesting tricks to avoid memcopy (calling it slow rofl)
| BeeOnRope wrote:
| You could probably to very close to this solution with C or
| C++, plus AVX intrinsics. Some might consider that "cheating"
| since intrinsics occupy kind of a grey area between a higher
| level language and asm.
| bigiain wrote:
| I suspect by the time you've disabled all the compiler
| "optimisations" that would lie knock a few orders of
| magnitude of performance off the directly translated
| algorithm, you may as well have written the assembler to
| start with...
|
| And you probably can't fine tune your C/C++ to get this
| performance without knowing exactly what processor
| instructions you are trying to trick the compiler into
| generating anyway.
| kccqzy wrote:
| > without knowing exactly what processor instructions you
| are trying to trick the compiler into generating
|
| In fact, in this case you know exactly what processor
| instructions the compiler is going to generate. You are
| using AVX intrinsics after all.
|
| And no, compiler optimizations work well with these
| intrinsics.
| m0ck wrote:
| Thanks for my daily dose of software engineer imposter syndrome.
| btheshoe wrote:
| It never goes away, does it
| andkon wrote:
| I was very grateful to see this disclaimer:
|
| > I've spent months working on this program
| no_time wrote:
| next up: custom FizzBuzz ASIC with 128 fizz and 128 buzz cores
| for maximum throughput.
| mordechai9000 wrote:
| I think optimizing binary tree inversion is a higher priority,
| right now.
| WJW wrote:
| Not for fizzbuzz optimisation it isn't! Get your priorities
| straight mate! Who is trying to get hired when we could be
| outputting more fizzes and/or buzzes! :)
| function_seven wrote:
| Substandard design. You should have a suitable ratio of cores
| to keep them load balanced. I suggest 160 fizz cores and 96
| buzz cores.
|
| EDIT: And a fizzbuzz co-processor, obviously.
| gumby wrote:
| > fizzbuzz co-processor, obviously
|
| One of the functional modules on the M1 Pro Max chip, but
| only the Max one, unfortunately.
|
| Take that, Intel!
| randy408 wrote:
| FizzBuzzCoin?
| wtallis wrote:
| Or if your ASIC generalizes the problem in a slightly different
| direction, you end up reinventing TWINKLE and TWIRL:
| https://en.wikipedia.org/wiki/TWINKLE
| nonameiguess wrote:
| Could probably store all multiples of 3 and 5 up to some really
| big number burned directly to hardware and then do something
| like a big CAM table the way Internet core routers do mapping
| the numbers to ASCII bit strings. Then speedup IO by not having
| a general purpose display, but something more like an old-
| school digital clock that can only show digits and the words
| "fizz" and "buzz."
| function_seven wrote:
| You'll get docked points from the typography nerd for
| displaying "fizz" instead of "fizz"
| jerf wrote:
| "The Grid. A digital frontier. I tried to picture clusters of
| information as they moved through the computer. What did they
| look like? Ships, motorcycles? Were the circuits like freeways? I
| kept dreaming of a world I thought I'd never see. And then, one
| day I got in...
|
| It turns out The Grid is just a guy sitting in a chair, shouting
| about "Fizz!" and "Buzz!" as fast as he can.
|
| It wasn't really what I had in mind.
|
| (The image of this poor program, stuck shouting "fizz!" and
| "buzz!" for subjectively centuries at a time struck me...)
| jedberg wrote:
| I kinda hope as a joke in Tron 3 they have a guy muttering in
| the corner "1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz" as someone
| walks by.
| tigershark wrote:
| Still better than throwing away a significant percentage of the
| world total power output just to compute some useless hashes if
| you ask me..
___________________________________________________________________
(page generated 2021-10-28 23:00 UTC)