[HN Gopher] One Billion Nested Loop Iterations
       ___________________________________________________________________
        
       One Billion Nested Loop Iterations
        
       Author : behnamoh
       Score  : 51 points
       Date   : 2024-11-26 18:17 UTC (4 hours ago)
        
 (HTM) web link (benjdd.com)
 (TXT) w3m dump (benjdd.com)
        
       | davikr wrote:
       | This does not seem like a realistic benchmark target.
        
         | mrkeen wrote:
         | Perhaps it's unrealistic in the sense that this is
         | unrealistically-optimisable code.
         | 
         | Real-world code has closures, allocations, and pointer-
         | dererencing. This code is just iterations, additions and
         | modulus.
        
           | behnamoh wrote:
           | > This code is just iterations...
           | 
           | Which is what the average programmer out there writes.
           | Therefore, I think this is actually a good realistic
           | benchmark.
        
             | steveBK123 wrote:
             | Precisely - its bad code, which is what most code is!
        
               | handfuloflight wrote:
               | What's a Github repository with good code?
        
             | elzbardico wrote:
             | What the average programmer out there writes is just a tip
             | on the iceberg of what gets executed. And an incredibly
             | small tip.
        
         | deanCommie wrote:
         | The other problem with it is that there are other variables
         | involved besides pure speed, which is CONSISTENCY of speed.
         | 
         | Folks are always surprised to see just how fast Java is able to
         | perform on benchmarks - it has a reputation as such a slow
         | language then how come it's able to execute almost as fast as
         | C++?
         | 
         | But Java's problem isn't execution performance. It's:
         | 
         | * Startup performance. It takes time to instantiate the JVM.
         | That matters for some, not for others.
         | 
         | * Consistent performance. This is the nature of Garbage
         | Collection. It can surprise you when it executes you and drop a
         | performance blip when you least expect it.
         | 
         | Most developers who think they need the fastest performance
         | actually need CONSISTENT performance more than the fastest.
        
           | PaulHoule wrote:
           | I even see some video games (say _Dome Keeper_ ) that
           | periodically go out to lunch even on a top of the line gaming
           | PC and understand that kind of game is often written in C#
           | which has a garbage collector. Thing is I remember playing
           | games on the much weaker PS Vita that were written in C# but
           | I never remember pausing like that.
           | 
           | VR games are now the frontier of gaming (in terms of the
           | industry outdoing itself) and there you're approaching hard
           | real time requirements because it is so awful to get seasick.
           | I appreciate it.
        
             | bob1029 wrote:
             | I think it might have more to do with the specific
             | implementation than the tool.
             | 
             | Most incarnations of C# offer a GC.Collect method and an
             | ability to configure the threading model around
             | collections. Used properly, you can keep maximum delays
             | well-bounded. You still have to work your ass off to
             | minimize allocations, but when some do inevitably occur due
             | to framework internals, etc., you don't want to have to
             | clean up 3 hours worth of it at once. Do it every frame or
             | scene change.
        
         | moralestapia wrote:
         | Nor does it pretend to be.
         | 
         | It's literally just "1 Billion nested loop iterations", methods
         | and results.
         | 
         | You make of that whatever you want.
        
       | xutopia wrote:
       | It blows my mind to see that speed difference between different
       | languages. In even the slowest language there we have lots of
       | ways we could optimize this contrived code but this is not
       | something that we can always do in the real world. Seeing the
       | visualization here really underscores how fast some languages
       | are.
        
         | swatcoder wrote:
         | > Seeing the visualization here really underscores how fast
         | some languages are.
         | 
         | Yes and no.
         | 
         | It emphasizes that there are some scenarios where language
         | choice can have an outsized effect on performance when running
         | on modern architectures, which is a reminder some people could
         | use or have never really witnessed.
         | 
         | But it's a very particular scenario that relatively few
         | projects will encounter at a scale that meaningfully impacts
         | them.
         | 
         | In particular, C and Rust are much better positioned to
         | instruction pipelining and lack of cache busting because there
         | are no conditionals and little-to-no out-of-loop code to
         | execute. Even with a massive loop like this, that can be a real
         | world scenario in things like data/signal processing and
         | numerical computation. When you're doing those things, language
         | choice can make a huge difference that may not be apparent
         | without understanding how they work or at least seeing
         | visualizations like this.
         | 
         | But even in C and Rust, most applications don't have to do this
         | kind of thing. Most are dominated by smaller loops, with
         | branching that breaks pipelining, complexity that busts cpu
         | cache, calls and jumps that might interfere with both etc. In
         | those much more common cases, the practical performance
         | difference is going to be quite a lot less significant, and the
         | fluency/ergonomics/safety/convenience of working in some other
         | language can easily matter more.
        
           | PaulHoule wrote:
           | I like the extreme boost that PyPy gets here which is most
           | more than you usually get with PyPy, on one hand it shows the
           | potential of PyPy but it is testing just one area where PyPy
           | is strong, other workloads depend on performance that is not
           | well optimized.
           | 
           | My take though is that branchy, old-AI, and heavy code (say
           | keep every fact in a triple store and not as a Python object
           | at the benefit of having an index for every permutation of
           | ?s-?p-?o) that cannot be handled with numpy or similar tools
           | benefits greatly from PyPy, but not 100x.
        
       | superjan wrote:
       | It's a billion times integer modulus, summed. Div/mod is by far
       | the slowest arithmetic instruction on a CPU, so it underestimates
       | C/rust performance. I guess the author chose that to prevent the
       | C compiler from optimizing the loop away.
        
         | bddicken wrote:
         | Author here: Yeah, the mod was added to reduce risk of loop(s)
         | getting optimized away.
        
           | remcob wrote:
           | Did you verify this through disassembly? These loops can
           | still be replaced by a closed form expression and I wouldn't
           | be surprised if LLVM figured this out.
        
             | Dylan16807 wrote:
             | If it turned the inner loop into a closed form expression,
             | I would expect the outer loop to go through 10k iterations
             | a _lot_ faster than needing half a second.
        
       | fbn79 wrote:
       | Would be nice to know max RAM usage by each implementation.
        
       | maxmcd wrote:
       | Another fave of mine by this person: https://benjdd.com/az/
        
       | o11c wrote:
       | Note that these benchmarks ignore object-related stuff, which is
       | usually the expensive part.
        
       | JohnMakin wrote:
       | Looking at the source code, is the random value generated before
       | the main loop iterations to stop the compiler from shortcutting
       | all the iterations? Does that work on some of the better
       | compilers out there? The random value is only generated once. I
       | would like to see a similar benchmark on creating/destroying
       | objects in memory.
        
         | neonsunset wrote:
         | If you are interested in looking into the details of
         | microbenchmarks of this type, there is always
         | https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
         | which has clearly defined methodology and allows to understand
         | wall-clock time vs CPU cost too.
         | 
         | The benchmark code linked in the submission unfortunately
         | suffers from the mistakes that happen whenever you write a
         | benchmark for the first time. The visuals are pretty but might
         | as well have been based on something more comprehensive.
         | 
         | (also, please stop measuring performance with loops that
         | perform no work and fibonacci sequences, this was bad 30 years
         | ago, this is bad today too)
        
           | JohnMakin wrote:
           | Cool, thanks! In another lifetime I had an internship at an
           | embedded software company which was to write tests and
           | optimize compute or throughput for various chips and chip
           | firmware, and reading the functions here sparked some alarm
           | bells in my head. Still a fun idea though and cool graphics.
        
         | swatcoder wrote:
         | Yeah, the random value prevents compile-time pre-calculation
         | and the print makes sure the loops actually get run and values
         | actually get calculated (as modern compilers are wont to simply
         | omit code that clearly doesn't have consequence).
        
       | fifilura wrote:
       | Doing Python or R right means using vector/matrix operations in
       | e.g. numpy. Those will be faster.
       | 
       | Just think twice before using a for loop. Most of the time it is
       | not needed.
        
         | MooseBurger wrote:
         | I disagree w.r.t. python. You shouldn't need a numerical
         | computing library to do a for loop generally.
        
           | fifilura wrote:
           | Kick the habit. For loops are what "goto" was 20 years ago.
           | Error prone and not possible to abstract for example for
           | parallel execution.
        
       | iambvk wrote:
       | Golang folks, after looking at the code, I am surprised and don't
       | understand why Go version is slow compared to C/C++/Rust. Can
       | anyone please explain? Thank you.
        
         | bddicken wrote:
         | Author here: Yeah, I was surprised that there doesn't seem to
         | be many options for extra optimizations in the Go compiler.
         | Would be curious if Go experts have more insight or other
         | compiler recommendations.
        
           | boyter wrote:
           | I doubt its the GC kicking in, but you could run it with the
           | following environment variable set just to ensure that its
           | not doing anything.                   GOGC=-1
           | 
           | EDIT: Trying it out quickly shows a small improvement
           | actually, but so small as to likely be noise as I was doing
           | other things on the machine.                   Summary
           | ./mainnogc 1000 ran             1.01 +- 0.06 times faster
           | than ./maingc 1000
        
       | mcdeltat wrote:
       | You have to be extremely careful with benchmarks like these. It's
       | very easy to measure and conclude different things. In this case,
       | it really only measures the compiler's ability to optimise the
       | `for` loop construct - nothing else. It doesn't say anything much
       | about the languages' general "speed", whatever that means (which
       | is generally the implication with tests like this).
       | 
       | Some points to consider:
       | 
       | - `for` is not how you'd write an optimised loop in all
       | languages.
       | 
       | - Be VERY suspicious of microbenchmarks on C, C++, and similar -
       | the compilers are smarter than you think and you won't get the
       | code you wrote.
       | 
       | - Languages do arrays differently - added memory effects on
       | performance.
       | 
       | - Languages do numbers differently.
       | 
       | - Summing and modulo is somewhat contrived example. Modulo is one
       | of the slower instructions out there.
        
       | pbw wrote:
       | I wrote this to probe the question of why "slow" languages are
       | popular and valid when used in conduction with fast ones:
       | https://tobeva.com/articles/beyond/
        
       | obituary_latte wrote:
       | The animation is confusing...is one trip to the end and back the
       | 1B iterations? I thought maybe they were counting up and once the
       | slowest reached the end they would all stop but that didn't seem
       | to be it. Maybe if it did (and there was a counter on each
       | showing how many bounces they made), it'd be a little more clear.
       | 
       | Interesting nonetheless.
        
         | ayaros wrote:
         | Looks like they are just bouncing back and forth at relative
         | speeds to each other; the speed of each ball is all that
         | matters. I also found this diagram confusing at first.
        
       | wanderingmind wrote:
       | How in the world is pypy faster than cpython by almost 2 orders
       | of magnitude. Is pypy normally faster in other benchmarks. If so,
       | why? Can some python expert pitch in?
        
         | lights0123 wrote:
         | PyPy uses just-in-time compilation, so for arithmetic-heavy
         | computations not involving the GC, it should absolutely be
         | faster.
        
       | ayaros wrote:
       | I'm working on a web app in JS, the sort of thing I think people
       | on here will enjoy and I'll definitely share it when it's
       | presentable. It's all in JS, including the graphical aspects of
       | it (I'm drawing to a canvas). For per-pixel rendering I have 4
       | levels of nested loops. I feel like I've done everything in my
       | power to optimize the process, and it's still not at the ideal
       | speed. Really, it ought to be about 5-10 times faster than it is
       | now.
       | 
       | I've been thinking of getting it into WASM somehow - it's really
       | just one function that needs to be replaced by a WASM module -
       | but one of the things which has given me pause are benchmarks
       | like this, as well as other anecdotes, that demonstrate I'd only
       | get a minor speedup from WASM. I guess part of this is due to
       | modern JS engines being so optimized.
       | 
       | Just about all of the relevant code is adding and subtracting
       | integers, but theres's some Boolean logic. There's also some
       | multiplication in there that I could eliminate if I really had
       | to, although at this point I'm not certain it would make a
       | significant difference. Most of the data that contributes to the
       | value of each pixel is spread across a tree of objects, and
       | there's some polymorphism involved too. The point is it's not
       | like I'm just dealing with a bunch of contiguous data in memory
       | that is easily cached by the CPU, which might speed things up
       | considerably. I guess I'm just venting now, lamenting a nightmare
       | of my own creation...
        
       | eiathom wrote:
       | It's interesting to imagine the head scratching going on in this
       | thread. All these clever people, stumped. 'How can this be!???'.
       | Quite the post, bravo OP.
       | 
       | But seriously, other comments have this nailed. This 'benchmark'
       | isn't comparing apples with apples. There are too many quirks,
       | optimisations and design decisions (of a compiler) that have not
       | being taken into account.
       | 
       | Also, as others have repeated, this isn't indicative of actual
       | usage - real world usage. In that regard, this type of
       | 'benchmark' isn't scientific at all, and worse, is misleading a
       | naive reader.
        
         | Dylan16807 wrote:
         | While it's not (necessarily) indicative of the languages as a
         | whole, it looks pretty apples to apples to me.
         | 
         | These are very simple programs with no fancy tricks going on.
        
       | sysread wrote:
       | I tried this with elixir, and on my m3 mac, it ran in 6.5s. Perl
       | took over 40s :P                 #!/usr/bin/env elixir       u =
       | case System.argv() do           [arg] -> String.to_integer(arg)
       | _ -> raise "Please provide a valid integer as the first
       | argument."         end            r = :rand.uniform(10_000) - 1
       | a = :array.new(10_000, default: 0)            a =
       | Enum.reduce(0..9_999, a, fn i, acc ->           inner_sum =
       | Enum.reduce(0..99_999, 0, fn j, sum ->               sum + rem(j,
       | u)             end)                updated_value = :array.get(i,
       | acc) + inner_sum + r           :array.set(i, updated_value, acc)
       | end)            IO.puts(:array.get(r, a))
        
         | tmtvl wrote:
         | Oh, the programs are supposed to take an integer from the
         | command line? Then I wonder how many of them would fail when
         | given an integer that's just a wee bit too large (like, say
         | (2^129)-1).
        
       ___________________________________________________________________
       (page generated 2024-11-26 23:02 UTC)