[HN Gopher] Optimizing the Particle Life: From 400 to 4M particles
       ___________________________________________________________________
        
       Optimizing the Particle Life: From 400 to 4M particles
        
       Author : caranea
       Score  : 115 points
       Date   : 2024-03-18 15:32 UTC (7 hours ago)
        
 (HTM) web link (programmingattack.com)
 (TXT) w3m dump (programmingattack.com)
        
       | hyperbrainer wrote:
       | Why was C slower though?
        
         | KeplerBoy wrote:
         | No optimization flags could be a big part of the reason.
         | 
         | Haven't looked closely at the code or tried it, but with -O3,
         | -fopenmp and a well-placed pragma the performance could
         | increase many-fold.
         | 
         | Heck, with NVC++ you could offload that thing to a GPU with
         | minimal effort and have it flying at the memory bandwidth
         | limit.
        
         | wzdd wrote:
         | Simply compiling it with -O3 produces something which completes
         | in half the time of the JavaScript version (350ms for C, 750ms
         | for JS), so perhaps that.
         | 
         | Edit for Twirrim: on this system (Ryzen 7, gcc 11): "-O3":
         | 350ms; "-O3 -march=native": 208ms; "-O2": 998ms; "-O2
         | -march=native": 1040ms.
         | 
         | Edit 2: Interestingly, changing the C from float to double
         | produces a 3.5x speedup, taking the time elapsed (with "-O3
         | -march=native") to 58ms, or about 12x faster than JS. This also
         | makes what it's computing closer to the JavaScript version.
        
           | anon946 wrote:
           | Did you add output of the results to the C code?
        
             | wzdd wrote:
             | I did not, but I confirmed with objdump that my compiler
             | was not removing the code.
             | 
             | (But to be sure, I just ran it again with an output and got
             | the same value.)
        
           | Twirrim wrote:
           | For fun and frolics:
           | 
           | No flags: 1843ms
           | 
           | -march=native: 2183 ms
           | 
           | -O2: 423 ms
           | 
           | -O2 -march=native: 250 ms
           | 
           | -O3: 425 ms
           | 
           | -O3 -march=native: 255 ms
           | 
           | O3 doesn't seem to be helping in my case.
        
             | Tadpole9181 wrote:
             | They didn't use _any_ optimization flags.
        
               | Twirrim wrote:
               | Yeah, I was just trying to show the difference. Doing it
               | without optimisation flags is an utterly bewildering
               | decision by the author.
        
           | jasonjmcghee wrote:
           | Seems crazy to me that double would produce that kind of
           | speed up. Is float getting emulated somehow? Don't they end
           | up the same size?
        
         | pvg wrote:
         | Author mentions they didn't use optimization flags but doesn't
         | include the compilation details. You can sort of guess that
         | (relatively) unoptimized C might perform worse than V8's JIT on
         | short, straight computational code - you're more or less
         | testing two native code generators doing a simple thing except
         | one has more optimizations enabled and wins.
        
           | hyperbrainer wrote:
           | Oh, I assumed he still did -O2, and did not do anything else.
           | Is that bad to assume?
           | 
           | PS: I do not use C beyond reading some of its code for
           | inspiration, so kinda unaware
        
             | pvg wrote:
             | It just says 'no optimization flags' and that's that, I
             | don't think even the compiler is mentioned so the author is
             | not giving you a lot to go on here - you don't even know
             | what the default is.
             | 
             | A modern C optimizing compiler is going to go absolutely
             | HAM on this sort of code snippet if you tell it to and do
             | next to nothing if you don't - that's, roughly, the
             | explanation for this little discrepancy.
        
       | anon946 wrote:
       | IMO, not using any optimization flags with C is somewhat
       | arbitrary, since the compiler writers could have just decided
       | that by default we'll do thing X, Y, and Z, and then you'd need
       | to turn them off explicitly.
       | 
       | FWIW, without -O, with -O, and with -O4, I get 2500ms, 1500ms,
       | and 550ms respectively. I didn't bother to look at the .S to see
       | the code improvements. (Of course, I edited the code to output
       | the results, otherwise, it just optimized out everything.)
        
         | hyperbrainer wrote:
         | > (Of course, I edited the code to output the results,
         | otherwise, it just optimized out everything.)
         | 
         | O(1) :)
        
         | caranea wrote:
         | Thanks for posting your results!
         | 
         | Since I was already set on writing in-browser particle life, I
         | didn't benchmark C code with different flags.
        
           | anon946 wrote:
           | Completely reasonable. I'd probably edit your blog post a bit
           | to indicate that.
        
         | TylerE wrote:
         | Should also test -Os when doing this sort of thing. Sometimes
         | the reduced size greatly improves cache behavior, and even when
         | not it's often outright competitive with at least -O2 anyway
         | (usually compiles faster too!)
        
         | abainbridge wrote:
         | One optimization for the C code is to put "f" suffixes on the
         | floating point constants. For example convert this line:
         | t[i] += 0.02 * (float)j;
         | 
         | to:                   t[i] += 0.02f * (float)j;
         | 
         | I believe this helps because 0.02 is a double and doing double
         | * float and then converting the result to float can produce a
         | different answer to just doing float * float. The compiler has
         | to do the slow version because that's what you asked for.
         | 
         | Adding the -ffast-math switch appears to make no difference.
         | I'm never sure what -ffast-math does exactly.
         | 
         | Minimal case on Godbolt:
         | 
         | https://godbolt.org/z/W18YsnMY5 - without the f
         | 
         | https://godbolt.org/z/oc1s8WKeG - with the f
        
           | a1369209993 wrote:
           | > I believe this helps because 0.02 is a double and [...] can
           | produce a different answer
           | 
           | In principle, not quite. The real/unavoidable(-by-the-
           | compiler) problem is that 0.02 is a not a diadic rational
           | (not representable exactly as some integer over a power of
           | two). So its representation (rounded to 52 bits) as a double
           | is a different real number than its representation (rounded
           | to 23 bits) as a float. (This is the same problem as rounding
           | pi or e to a double/float, but people tend to forget that it
           | applies to all _diadic_ irrationals, not just regular
           | irrationals.)
           | 
           | If, instead of `0.02f` you replaced `0.02` with
           | `(double)0.02f` or `0.015625`, the optimization should in
           | theory still apply (although missed optimization complier
           | bugs are of course possible).
        
             | abainbridge wrote:
             | OK, right, that's the clarity of thought I was missing.
             | 
             | But in this case the compiler still misses the optimization
             | with '(double)0.02f'. https://godbolt.org/z/az7819nKM
             | 
             | I think this is because the optimization isn't safe. I
             | wrote a program to find a counter example to your claim
             | that "the optimization should in theory still apply". It
             | found one. Here's the code:                   #include
             | <stdio.h>         #include <stdlib.h>              float
             | mul_as_float(float t) {           t += 0.02f * (float)17;
             | return t;         }              float mul_as_double(float
             | t) {           t += (double)0.02f * (float)17;
             | return t;         }              int main() {
             | while (1) {                 unsigned r = rand();
             | float t = *((float*)&r);                      float result1
             | = mul_as_float(t);                 float result2 =
             | mul_as_double(t);                 if (result1 != result2) {
             | printf("Counter example when t is %f (0x%x)\n", t,
             | *((unsigned*)&t));                     printf("result1 is
             | %f (0x%x)\n", result1, *((unsigned*)&result1));
             | printf("result2 is %f (0x%x)\n", result2,
             | *((unsigned*)&result2));                     return 0;
             | }             }         }
             | 
             | It outputs:                   Counter example when t is
             | 0.000000 (0x3477d43f)         result1 is 0.340000
             | (0x3eae1483)         result2 is 0.340000 (0x3eae1482)
             | 
             | What do you think?
        
               | a1369209993 wrote:
               | On my machine, the complier constant-folds the
               | multiplication, producing a single-precision add for
               | `mul_as_float` and a convert-t-to-double, double-
               | precision-add, convert-sum-to-single for `mul_as_double`.
               | I missed the `+=` in your original comment, but adding a
               | float to a double does implicitly promote it like that,
               | so you'd actually need:                 t +=
               | (float)((double)0.02f * (float)17);
               | 
               | to achieve the "and then converting the result [of the
               | multiplication] to float" (rather than keeping it a
               | double for the addition) from your original comment.
               | (With the above line in mul_as_double, your test code no
               | longer finds a counterexample, at least when I ran it.)
               | 
               | If you ask for higher-precision intermediates, even
               | implicitly, floating-point compliers will typically give
               | them to you, hoped-for efficiency of single-precision be
               | damned.
        
               | abainbridge wrote:
               | Ah right, yep.
        
       | CyberDildonics wrote:
       | This is fairly trivial stuff with underwhelming results. 4
       | million particles isn't much on modern hardware, even with
       | spatial partitioning. The hardest part of spatial partitioning is
       | probably the one dimensional partition algorithm in the first
       | place, which can be found here:
       | 
       | https://en.cppreference.com/w/cpp/algorithm/partition
        
         | iainmerrick wrote:
         | It's not rocket science, but "trivial" is very harsh. This
         | stuff is fiddly to get right. The author (who I assume is a
         | student) seems to have done a good job and it's a nice writeup.
        
           | CyberDildonics wrote:
           | I guess you could argue that, but it's strange that people
           | want to give a student's first project (with a lot of huge
           | mistakes like thinking node is going to outperform C because
           | they didn't turn optimizations on) so much interest. I think
           | people are taken in by big numbers and assume there is
           | something cutting edge because they don't know any better.
           | 
           | The results are just some particles in an extremely basic
           | pattern, there isn't a lot of payoff.
        
             | caranea wrote:
             | One point needs correcting though - nowhere in the article
             | I say that node is outperforming C, on the contrary - I
             | stressed that I didn't use any flags so that people don't
             | come to a conclusion that C would be always slower, and I
             | explicitly mentioned not get fixated on such benchmarks.
             | 
             | What it was meant to show is that V8 is *good enough* to
             | even consider it for the job :)
        
               | CyberDildonics wrote:
               | You benchmarked three different programs and one was
               | compiled to be slow. Why give times for a debug build
               | when the other two aren't? Showing off performance
               | metrics that are both apples to oranges while also giving
               | times for something not made to run fast is total
               | nonsense.
               | 
               | Do you really not think this is a mistake? Most people
               | would see C getting outperformed and realize there is
               | something wrong. I'm shocked anyone would both this then
               | try to rationalize it after.
        
       | atulvi wrote:
       | 4M particles is not the question. what's the framerate?
        
         | bee_rider wrote:
         | You can go to the "next" story at the bottom of the post, and
         | then the author has some links to an online implementation.
        
         | jasonjmcghee wrote:
         | Would vary significantly based on the GPU, I would assume.
        
       | Jupe wrote:
       | This is pretty neat...
       | 
       | Having an interest in the space for many years (on-and-off; life
       | gets busy sometimes) - I've often lamented the lack of good
       | pixel-level performance optimizations in graphics cards.
       | Everything seems hell-bent on polygons.
       | 
       | Some years ago, DirectDraw on Windows was an excellent way to
       | optimize the graphics portion of these types of things, but that
       | all went away as "3D Is The Way" mentality took over.
       | 
       | My explorations of Processing.* were neat, but lacked the IDE and
       | library support I like as a developer.
        
         | pixelpoet wrote:
         | There's no lack of "pixel-level performance optimisations" on
         | GPUs, and just because basically the whole world seems to think
         | that graphics programming == using some rasterisation library
         | like OpenGL or DX or Vulkan or whatever for realtime
         | applications, doesn't mean you're forced to use it or that
         | that's all there is.
         | 
         | Just fire up OpenCL (or CUDA, if you must) and start executing
         | some data-parallel kernels.
        
           | Jupe wrote:
           | That's all a little beyond me at this point. But very
           | compelling!
           | 
           | As the original simulation shows, the idea is to calculate
           | the difference and proximity of the pixels. So the drawing
           | part is just a projection...
           | 
           | The ease of getting a memory buffer pointer, setting pixels
           | within my calculation loop with simple offsets was very
           | simple and compelling. I think I should look deeper into
           | OpenCL/CUDA for this use-case, as I mentioned in the other
           | response to my comment.
           | 
           | Thanks!
        
             | pixelpoet wrote:
             | No problem, and I highly recommend using OpenCL to get
             | started, it's actually pretty short and straightforward
             | (lookin' at you, Vulkan compute!). So basically, the stuff
             | in the for-loops you had before, that's the kernel you'll
             | be writing.
             | 
             | I daresay it's actually easier to render things this way
             | than getting lost in the weeds with API docs etc. A simple
             | Mandelbrot hello-world example should be easy to understand
             | and get compiling, and you can go from there.
        
           | dartos wrote:
           | Is vulkan strictly a rasterization library?
           | 
           | I thought you could also do numeric calculations with compute
           | shaders.
        
             | pixelpoet wrote:
             | Yep you can, but there's still an astronomical amount of
             | crap you have to do that has nothing to do with pure
             | compute. OpenCL has none of these problems, and is IMO just
             | the perfect compute API.
        
         | smlacy wrote:
         | The thing you describe as "per-pixel optimizations" is exactly
         | what a fragment shader is? You don't need "a polygon" per say,
         | you can just run said code over ever pixel in the frame for
         | each frame. This is how simple screenspace effects are
         | implemented, compositing, HDR, etc.
        
           | Jupe wrote:
           | Interesting. While I've played with ShaderToy a bit, I've not
           | gotten such a thing to run locally. Maybe I should revisit
           | this..
           | 
           | I guess all that I'd need would be in-memory buffer of my
           | field, then a single function to copy the buffer (or section
           | of the buffer if I want larger than screen fields) to the
           | video buffer.
        
       | bee_rider wrote:
       | Since I'm on my phone, I can only try the 2D js versions, but the
       | quadtree version does seem to provide some really obvious and
       | nice frame rate improvements. It seems like good work, although I
       | don't know much about this field.
       | 
       | Sidenote:
       | 
       | I'm often too self conscious to put anything technical online
       | with my real name on it, probably a bit of imposter syndrome or
       | something. I think based on
       | 
       | > For one of my programming classes this semester, my task was
       | writing a program in Python. It was only an introductory course,
       | so it didn't have to be anything fancy.
       | 
       | the author is sort of early on in their career. I often suspect
       | I'd be better off if I'd gotten over myself and thrown more early
       | experiments online. I dunno, maybe this stuff just isn't a big
       | deal for some people, but I have to applaud the bravery.
        
       | peter_l_downs wrote:
       | I was having trouble finding the link to the actual project, so
       | here you go:
       | 
       | - Github repo https://github.com/Caranea/gpulife
       | 
       | - Finished particle simulator https://webgpulife.netlify.app/
        
       | danielam wrote:
       | This reminds me of a paper[0] I co-wrote during a college
       | internship that involved simulating the evolution of a Brownian
       | system of particles with production and annihilation rules as the
       | evolution of probability density distributions for each particle
       | type within a partitioned cubic periodic box. Because we were
       | interested in how these particles segregated spatially,
       | visualization was a matter of finding the boundaries separating
       | each domain of particles. I made some fun little videos of the
       | evolution of the system over time (which tracks very
       | conspicuously the entropy production, plotted in the paper), but
       | unfortunately, they've since been taken down from the institute's
       | website where they were hosted. (The paper does contain some
       | snapshots that went into these visualizations, however.)
       | 
       | [0] https://pubs.aip.org/aip/jcp/article-
       | abstract/122/17/174105/...
        
       | Twirrim wrote:
       | The thing that strikes me as particularly weird about this is the
       | use of numpy there. In other languages, they're using native
       | code. In python they're reaching out to numpy, which is a great
       | library, but not awesome inside a hot loop unless you're keeping
       | the operation you're carrying out within numpy itself. This
       | means, right in that hot loop, they're doing a lot of translating
       | of numbers between python representation and native c
       | representation.
       | 
       | Cutting numpy out by making the line "t = [0.0]*l" gets it down
       | to 17059 ms, without attempting any other optimisations. Using
       | that plus pypy (so you get the JIT that you have with
       | javascript/v8) gets it down to 958 ms.
       | 
       | To show the cost of those translations, sticking with numpy. If
       | we switch that inner loop to:                       temp_t_i =
       | t[i]             temp_t_i += 0.02 * j             temp_t_i *=
       | 0.03 * j             temp_t_i -= 0.04 * j             temp_t_i /=
       | 0.05 * (j+1)             temp_t_i = temp_t_i
       | 
       | It speeds us up from 47552 ms to 28251 ms. Almost half the
       | execution time. That's still doing two hops back and forth,
       | though. If you cut it down to a single line it's even faster at
       | 18458 ms, cutting execution time down to about a third of the
       | original example. Pypy isn't able to help here at all, this is
       | sort of a pathological case for it.
       | 
       | edit: I'll add, I'm not that good with numpy, rarely use it
       | myself. Not sure if it's possible to do that inner loop all
       | within numpy somehow. I imagine that'd be a lot faster still.
        
         | nerpderp82 wrote:
         | This is the problem that Numba was designed to solve.
         | 
         | https://numba.readthedocs.io/en/stable/user/5minguide.html
        
           | cl3misch wrote:
           | No. The problem in this post can be vectorized (== expressed
           | as array OPs) with idiomatic numpy, making it very fast (see
           | sibling comment).
           | 
           | Numba is designed to speed up code where looping is
           | unavoidable, i.e. the code can't be (easily) expressed as
           | array OPs.
        
             | GeompMankle wrote:
             | Negative again, a series of array operations which are
             | individually idiomatic numpy like this will run very very
             | fast in numba as it can coalesce the ops into a single pass
             | through memory. Numpy can't do this and has to pass through
             | the array for each individually array operation. There's
             | nothing wrong with straight numpy but if you want it
             | compiled-C fast for the whole ensemble of array ops, you
             | need a JIT.
        
               | cl3misch wrote:
               | Ah, I was under the impression that using array OPs
               | inside a numba njit gave _worse_ performance. Has this
               | changed, or is my memory tricking me?
               | 
               | Do you have a source for this? I have not seen it in the
               | numba docs.
        
               | physicsguy wrote:
               | Numba replaces ops with LLVM byte code equivalent so the
               | compiler will optimise out and coalesce the operations as
               | part of the optimisation stage.
               | 
               | If you want to look at the sort of things compilers do
               | though, take a look at "common subexpression elimination"
        
         | bscottmay wrote:
         | Yeah, an all-numpy version runs in less than 1ms on my M1 air
         | import numpy as np       l = 10_000       t = np.empty(l,
         | dtype=np.float32)       j = np.arange(l)       t = 0.02 \* j
         | t *= (0.03 * j)       t -= (0.04 \* j)       t /= 0.05 \* (j +
         | 1)
        
           | Twirrim wrote:
           | So rather than spend 10-20 minutes reading about numpy, the
           | author wrote 3 other implementations...?
           | 
           | The fact that they ran the C code without the optimisation
           | flags and compared it that way makes me think Javascript was
           | what they actually wanted to write this one in anyway.
        
       | chris_va wrote:
       | Sounds like they are slowly learning electrostatic PIC codes from
       | first principles.
        
       ___________________________________________________________________
       (page generated 2024-03-18 23:00 UTC)