[HN Gopher] One Billion Nested Loop Iterations
___________________________________________________________________
One Billion Nested Loop Iterations
Author : behnamoh
Score : 51 points
Date : 2024-11-26 18:17 UTC (4 hours ago)
(HTM) web link (benjdd.com)
(TXT) w3m dump (benjdd.com)
| davikr wrote:
| This does not seem like a realistic benchmark target.
| mrkeen wrote:
| Perhaps it's unrealistic in the sense that this is
| unrealistically-optimisable code.
|
| Real-world code has closures, allocations, and pointer-
| dererencing. This code is just iterations, additions and
| modulus.
| behnamoh wrote:
| > This code is just iterations...
|
| Which is what the average programmer out there writes.
| Therefore, I think this is actually a good realistic
| benchmark.
| steveBK123 wrote:
| Precisely - its bad code, which is what most code is!
| handfuloflight wrote:
| What's a Github repository with good code?
| elzbardico wrote:
| What the average programmer out there writes is just a tip
| on the iceberg of what gets executed. And an incredibly
| small tip.
| deanCommie wrote:
| The other problem with it is that there are other variables
| involved besides pure speed, which is CONSISTENCY of speed.
|
| Folks are always surprised to see just how fast Java is able to
| perform on benchmarks - it has a reputation as such a slow
| language then how come it's able to execute almost as fast as
| C++?
|
| But Java's problem isn't execution performance. It's:
|
| * Startup performance. It takes time to instantiate the JVM.
| That matters for some, not for others.
|
| * Consistent performance. This is the nature of Garbage
| Collection. It can surprise you when it executes you and drop a
| performance blip when you least expect it.
|
| Most developers who think they need the fastest performance
| actually need CONSISTENT performance more than the fastest.
| PaulHoule wrote:
| I even see some video games (say _Dome Keeper_ ) that
| periodically go out to lunch even on a top of the line gaming
| PC and understand that kind of game is often written in C#
| which has a garbage collector. Thing is I remember playing
| games on the much weaker PS Vita that were written in C# but
| I never remember pausing like that.
|
| VR games are now the frontier of gaming (in terms of the
| industry outdoing itself) and there you're approaching hard
| real time requirements because it is so awful to get seasick.
| I appreciate it.
| bob1029 wrote:
| I think it might have more to do with the specific
| implementation than the tool.
|
| Most incarnations of C# offer a GC.Collect method and an
| ability to configure the threading model around
| collections. Used properly, you can keep maximum delays
| well-bounded. You still have to work your ass off to
| minimize allocations, but when some do inevitably occur due
| to framework internals, etc., you don't want to have to
| clean up 3 hours worth of it at once. Do it every frame or
| scene change.
| moralestapia wrote:
| Nor does it pretend to be.
|
| It's literally just "1 Billion nested loop iterations", methods
| and results.
|
| You make of that whatever you want.
| xutopia wrote:
| It blows my mind to see that speed difference between different
| languages. In even the slowest language there we have lots of
| ways we could optimize this contrived code but this is not
| something that we can always do in the real world. Seeing the
| visualization here really underscores how fast some languages
| are.
| swatcoder wrote:
| > Seeing the visualization here really underscores how fast
| some languages are.
|
| Yes and no.
|
| It emphasizes that there are some scenarios where language
| choice can have an outsized effect on performance when running
| on modern architectures, which is a reminder some people could
| use or have never really witnessed.
|
| But it's a very particular scenario that relatively few
| projects will encounter at a scale that meaningfully impacts
| them.
|
| In particular, C and Rust are much better positioned to
| instruction pipelining and lack of cache busting because there
| are no conditionals and little-to-no out-of-loop code to
| execute. Even with a massive loop like this, that can be a real
| world scenario in things like data/signal processing and
| numerical computation. When you're doing those things, language
| choice can make a huge difference that may not be apparent
| without understanding how they work or at least seeing
| visualizations like this.
|
| But even in C and Rust, most applications don't have to do this
| kind of thing. Most are dominated by smaller loops, with
| branching that breaks pipelining, complexity that busts cpu
| cache, calls and jumps that might interfere with both etc. In
| those much more common cases, the practical performance
| difference is going to be quite a lot less significant, and the
| fluency/ergonomics/safety/convenience of working in some other
| language can easily matter more.
| PaulHoule wrote:
| I like the extreme boost that PyPy gets here which is most
| more than you usually get with PyPy, on one hand it shows the
| potential of PyPy but it is testing just one area where PyPy
| is strong, other workloads depend on performance that is not
| well optimized.
|
| My take though is that branchy, old-AI, and heavy code (say
| keep every fact in a triple store and not as a Python object
| at the benefit of having an index for every permutation of
| ?s-?p-?o) that cannot be handled with numpy or similar tools
| benefits greatly from PyPy, but not 100x.
| superjan wrote:
| It's a billion times integer modulus, summed. Div/mod is by far
| the slowest arithmetic instruction on a CPU, so it underestimates
| C/rust performance. I guess the author chose that to prevent the
| C compiler from optimizing the loop away.
| bddicken wrote:
| Author here: Yeah, the mod was added to reduce risk of loop(s)
| getting optimized away.
| remcob wrote:
| Did you verify this through disassembly? These loops can
| still be replaced by a closed form expression and I wouldn't
| be surprised if LLVM figured this out.
| Dylan16807 wrote:
| If it turned the inner loop into a closed form expression,
| I would expect the outer loop to go through 10k iterations
| a _lot_ faster than needing half a second.
| fbn79 wrote:
| Would be nice to know max RAM usage by each implementation.
| maxmcd wrote:
| Another fave of mine by this person: https://benjdd.com/az/
| o11c wrote:
| Note that these benchmarks ignore object-related stuff, which is
| usually the expensive part.
| JohnMakin wrote:
| Looking at the source code, is the random value generated before
| the main loop iterations to stop the compiler from shortcutting
| all the iterations? Does that work on some of the better
| compilers out there? The random value is only generated once. I
| would like to see a similar benchmark on creating/destroying
| objects in memory.
| neonsunset wrote:
| If you are interested in looking into the details of
| microbenchmarks of this type, there is always
| https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
| which has clearly defined methodology and allows to understand
| wall-clock time vs CPU cost too.
|
| The benchmark code linked in the submission unfortunately
| suffers from the mistakes that happen whenever you write a
| benchmark for the first time. The visuals are pretty but might
| as well have been based on something more comprehensive.
|
| (also, please stop measuring performance with loops that
| perform no work and fibonacci sequences, this was bad 30 years
| ago, this is bad today too)
| JohnMakin wrote:
| Cool, thanks! In another lifetime I had an internship at an
| embedded software company which was to write tests and
| optimize compute or throughput for various chips and chip
| firmware, and reading the functions here sparked some alarm
| bells in my head. Still a fun idea though and cool graphics.
| swatcoder wrote:
| Yeah, the random value prevents compile-time pre-calculation
| and the print makes sure the loops actually get run and values
| actually get calculated (as modern compilers are wont to simply
| omit code that clearly doesn't have consequence).
| fifilura wrote:
| Doing Python or R right means using vector/matrix operations in
| e.g. numpy. Those will be faster.
|
| Just think twice before using a for loop. Most of the time it is
| not needed.
| MooseBurger wrote:
| I disagree w.r.t. python. You shouldn't need a numerical
| computing library to do a for loop generally.
| fifilura wrote:
| Kick the habit. For loops are what "goto" was 20 years ago.
| Error prone and not possible to abstract for example for
| parallel execution.
| iambvk wrote:
| Golang folks, after looking at the code, I am surprised and don't
| understand why Go version is slow compared to C/C++/Rust. Can
| anyone please explain? Thank you.
| bddicken wrote:
| Author here: Yeah, I was surprised that there doesn't seem to
| be many options for extra optimizations in the Go compiler.
| Would be curious if Go experts have more insight or other
| compiler recommendations.
| boyter wrote:
| I doubt its the GC kicking in, but you could run it with the
| following environment variable set just to ensure that its
| not doing anything. GOGC=-1
|
| EDIT: Trying it out quickly shows a small improvement
| actually, but so small as to likely be noise as I was doing
| other things on the machine. Summary
| ./mainnogc 1000 ran 1.01 +- 0.06 times faster
| than ./maingc 1000
| mcdeltat wrote:
| You have to be extremely careful with benchmarks like these. It's
| very easy to measure and conclude different things. In this case,
| it really only measures the compiler's ability to optimise the
| `for` loop construct - nothing else. It doesn't say anything much
| about the languages' general "speed", whatever that means (which
| is generally the implication with tests like this).
|
| Some points to consider:
|
| - `for` is not how you'd write an optimised loop in all
| languages.
|
| - Be VERY suspicious of microbenchmarks on C, C++, and similar -
| the compilers are smarter than you think and you won't get the
| code you wrote.
|
| - Languages do arrays differently - added memory effects on
| performance.
|
| - Languages do numbers differently.
|
| - Summing and modulo is somewhat contrived example. Modulo is one
| of the slower instructions out there.
| pbw wrote:
| I wrote this to probe the question of why "slow" languages are
| popular and valid when used in conduction with fast ones:
| https://tobeva.com/articles/beyond/
| obituary_latte wrote:
| The animation is confusing...is one trip to the end and back the
| 1B iterations? I thought maybe they were counting up and once the
| slowest reached the end they would all stop but that didn't seem
| to be it. Maybe if it did (and there was a counter on each
| showing how many bounces they made), it'd be a little more clear.
|
| Interesting nonetheless.
| ayaros wrote:
| Looks like they are just bouncing back and forth at relative
| speeds to each other; the speed of each ball is all that
| matters. I also found this diagram confusing at first.
| wanderingmind wrote:
| How in the world is pypy faster than cpython by almost 2 orders
| of magnitude. Is pypy normally faster in other benchmarks. If so,
| why? Can some python expert pitch in?
| lights0123 wrote:
| PyPy uses just-in-time compilation, so for arithmetic-heavy
| computations not involving the GC, it should absolutely be
| faster.
| ayaros wrote:
| I'm working on a web app in JS, the sort of thing I think people
| on here will enjoy and I'll definitely share it when it's
| presentable. It's all in JS, including the graphical aspects of
| it (I'm drawing to a canvas). For per-pixel rendering I have 4
| levels of nested loops. I feel like I've done everything in my
| power to optimize the process, and it's still not at the ideal
| speed. Really, it ought to be about 5-10 times faster than it is
| now.
|
| I've been thinking of getting it into WASM somehow - it's really
| just one function that needs to be replaced by a WASM module -
| but one of the things which has given me pause are benchmarks
| like this, as well as other anecdotes, that demonstrate I'd only
| get a minor speedup from WASM. I guess part of this is due to
| modern JS engines being so optimized.
|
| Just about all of the relevant code is adding and subtracting
| integers, but theres's some Boolean logic. There's also some
| multiplication in there that I could eliminate if I really had
| to, although at this point I'm not certain it would make a
| significant difference. Most of the data that contributes to the
| value of each pixel is spread across a tree of objects, and
| there's some polymorphism involved too. The point is it's not
| like I'm just dealing with a bunch of contiguous data in memory
| that is easily cached by the CPU, which might speed things up
| considerably. I guess I'm just venting now, lamenting a nightmare
| of my own creation...
| eiathom wrote:
| It's interesting to imagine the head scratching going on in this
| thread. All these clever people, stumped. 'How can this be!???'.
| Quite the post, bravo OP.
|
| But seriously, other comments have this nailed. This 'benchmark'
| isn't comparing apples with apples. There are too many quirks,
| optimisations and design decisions (of a compiler) that have not
| being taken into account.
|
| Also, as others have repeated, this isn't indicative of actual
| usage - real world usage. In that regard, this type of
| 'benchmark' isn't scientific at all, and worse, is misleading a
| naive reader.
| Dylan16807 wrote:
| While it's not (necessarily) indicative of the languages as a
| whole, it looks pretty apples to apples to me.
|
| These are very simple programs with no fancy tricks going on.
| sysread wrote:
| I tried this with elixir, and on my m3 mac, it ran in 6.5s. Perl
| took over 40s :P #!/usr/bin/env elixir u =
| case System.argv() do [arg] -> String.to_integer(arg)
| _ -> raise "Please provide a valid integer as the first
| argument." end r = :rand.uniform(10_000) - 1
| a = :array.new(10_000, default: 0) a =
| Enum.reduce(0..9_999, a, fn i, acc -> inner_sum =
| Enum.reduce(0..99_999, 0, fn j, sum -> sum + rem(j,
| u) end) updated_value = :array.get(i,
| acc) + inner_sum + r :array.set(i, updated_value, acc)
| end) IO.puts(:array.get(r, a))
| tmtvl wrote:
| Oh, the programs are supposed to take an integer from the
| command line? Then I wonder how many of them would fail when
| given an integer that's just a wee bit too large (like, say
| (2^129)-1).
___________________________________________________________________
(page generated 2024-11-26 23:02 UTC)