hngopher.com

       [HN Gopher] SIMD Perlin Noise: Beating the Compiler with SSE (2014)
       ___________________________________________________________________
        
       SIMD Perlin Noise: Beating the Compiler with SSE (2014)
        
       Author : homarp
       Score  : 43 points
       Date   : 2025-07-21 05:28 UTC (2 days ago)
        
 (HTM) web link (scallywag.software)
 (TXT) w3m dump (scallywag.software)
        
       | jesse__ wrote:
       | Author here, AMA :)
        
         | 0points wrote:
         | Nice write-up, and congratulations on the result! Since it's
         | about perlin and performance, have you had a look at
         | opensimplex?
         | 
         | PS. bonsai looks really cool! Checking it out right now
        
           | jesse__ wrote:
           | I haven't looked at opensimplex. I will when I get around to
           | doing a simplex implementation.
           | 
           | And thanks for the kind words!
        
         | Keyframe wrote:
         | pretty sweet! I'm mostly interested in how / what did you do to
         | measure the performance and focus on a function. Is it perf
         | pretty much with hist or visualizer or what?
        
           | jesse__ wrote:
           | I just called _rdtsc() before and after the noise gen once
           | every iteration, and pushed the sample onto a fixed size
           | buffer .. after some N iterations (4k maybe, can't remember)
           | of samples, computed min/max/avg.
           | 
           | There's a little project here that I used to benchmark in
           | part 4
           | 
           | https://github.com/scallyw4g/bonsai_noise_bench
        
         | jokoon wrote:
         | you should post the result at the end
         | 
         | and yes, make a benchmark
         | 
         | (although I would not know how to make one, or what reference
         | point to use)
         | 
         | what do you think about the fastnoiselite implementation used
         | in godot?
        
           | jesse__ wrote:
           | I kinda did post results at the end of part 4 .. I beat the
           | SOTA by 1.8x
           | 
           | There's a benchmark utility here:
           | https://github.com/scallyw4g/bonsai_noise_bench
           | 
           | Fastnoise2 is a high quality library. Can't speak to
           | fastnoiselite .. never looked at it.
        
         | vlovich123 wrote:
         | Which compiler & optimization settings did you use? Out of
         | curiosity, any idea why the compiler failed to auto-vectorize
         | the loops?
        
           | jesse__ wrote:
           | Clang -O2
           | 
           | ..
           | 
           | -O3 didn't seem to make any appreciable difference.
           | 
           | Re. the auto-vectorization, I really don't know. I didn't
           | even read the assembly the compiler generated until at least
           | halfway through the process. Generally I've found that you
           | basically can't rely on the compiler auto vectorizing
           | anything, ever, if it actually matters.
        
       | addaon wrote:
       | Memories. As a personal project back in... 2003?... I decided to
       | do something similar, implement 4D Perlin Noise in Altivec
       | assembly. The only problem was that I had a G3 iBook; so I would
       | write one instruction of assembly, then write a C function to
       | interpret that assembly, building an interpreter for a very
       | selective subset of PPC w/ Altivec that ran (slooooowly) on the
       | G3. As I recall I got it down to ~200 instructions, and it worked
       | perfectly the first time I ran it on a G4, which was pretty
       | rewarding. Took me more than half a day, though. On an unrelated
       | note, I got an intership with Apple's performance team that
       | summer.
        
       | rincebrain wrote:
       | Did you profile the results with different compilers?
       | 
       | The last time I tried doing this kind of microoptimization for
       | fun, I ended up bundling actual assembly files, because the
       | generated assembly for intrinsics was so variable in performance
       | across compilers it was the only way to get consistent results on
       | many platforms.
        
         | jesse__ wrote:
         | I only build the project this is embedded in with clang, so
         | that's the only compiler I tested.
        
       | llm_nerd wrote:
       | HN loves SIMD, and there is a "how I hand crafted a SIMD
       | optimization" post doing numbers on here regularly. They're fun
       | posts, and it absolutely speaks to the fact that writing code
       | that optimizing compilers can robustly and comprehensively turn
       | into good SIMD branches is somewhat of a black art.
       | 
       | Which is why you, _generally_ , shouldn't be doing either. You
       | shouldn't rely upon the compiler to figure out your intentions,
       | and you shouldn't be writing SIMD instructions directly unless
       | you're writing a SIMD library or an optimizing compiler.
       | 
       | Instead you should reach for one of the many available libraries
       | that not only force you into appropriately structuring your data
       | and calls for SIMD goodness, they're massively more portable and
       | powerful.
       | 
       | Google's Highway, for instance, will let you use their abstracted
       | SIMD functions and it provides the optimization whether your
       | target is SSE2-4, AVX, AVX2, AVX512, AVX10, or if you build for
       | ARM NEON or SVE, for any conceivable vector size, or WASM's weird
       | SIMD functions, or RISC-V's RVV, and several more, and when new
       | widths and new options come out, the library adds the support and
       | you might not have to change your code at all.
       | 
       | There are loads of libraries like this (xsimd, EVE, SIMDe, etc).
       | They all force you into thinking about structuring your code in a
       | manner that is SIMDable -- instead of hoping the optimizing
       | compiler will figure it out on its own -- and provide targeting
       | for a vast trove of SIMD options without hand-writing for every
       | option.
       | 
       | I was going to quickly rewrite the example in Highway just to
       | demonstrate but the Perlin stuff seems to be missing or
       | significantly restructured.
       | 
       | " _But that is obvious and I 'm mad that you commented this_" -
       | no, it isn't obvious whatsoever, and this "I hand-rolled some SSE
       | now my app is super awesome look at the microbenchmark results on
       | a very narrow, specific machine" content appears on here
       | regularly, betraying a pretty big influence of beginners who _don
       | 't_ know that it's almost certainly the wrong approach.
        
         | 63 wrote:
         | This is a valuable viewpoint that lines up somewhat with some
         | other discussion I've seen on the topic [0]. I'd like to see
         | more posts about structuring code for the auto vectorizor (with
         | libraries or otherwise) rather than writing simd by hand. Do
         | you have any documentation you'd recommend?
         | 
         | [0] https://matklad.github.io/2023/04/09/can-you-trust-a-
         | compile...
        
         | jesse__ wrote:
         | I disagree pretty strongly with most of what you said, but I'd
         | be very interested in seeing a Highway example and looking at
         | the differences. Take a look through the comments, I left a
         | link to the test bench I made, which contains all the code.
        
       ___________________________________________________________________
       (page generated 2025-07-23 23:01 UTC)