[HN Gopher] Optimizing the Particle Life: From 400 to 4M particles
___________________________________________________________________
Optimizing the Particle Life: From 400 to 4M particles
Author : caranea
Score : 115 points
Date : 2024-03-18 15:32 UTC (7 hours ago)
(HTM) web link (programmingattack.com)
(TXT) w3m dump (programmingattack.com)
| hyperbrainer wrote:
| Why was C slower though?
| KeplerBoy wrote:
| No optimization flags could be a big part of the reason.
|
| Haven't looked closely at the code or tried it, but with -O3,
| -fopenmp and a well-placed pragma the performance could
| increase many-fold.
|
| Heck, with NVC++ you could offload that thing to a GPU with
| minimal effort and have it flying at the memory bandwidth
| limit.
| wzdd wrote:
| Simply compiling it with -O3 produces something which completes
| in half the time of the JavaScript version (350ms for C, 750ms
| for JS), so perhaps that.
|
| Edit for Twirrim: on this system (Ryzen 7, gcc 11): "-O3":
| 350ms; "-O3 -march=native": 208ms; "-O2": 998ms; "-O2
| -march=native": 1040ms.
|
| Edit 2: Interestingly, changing the C from float to double
| produces a 3.5x speedup, taking the time elapsed (with "-O3
| -march=native") to 58ms, or about 12x faster than JS. This also
| makes what it's computing closer to the JavaScript version.
| anon946 wrote:
| Did you add output of the results to the C code?
| wzdd wrote:
| I did not, but I confirmed with objdump that my compiler
| was not removing the code.
|
| (But to be sure, I just ran it again with an output and got
| the same value.)
| Twirrim wrote:
| For fun and frolics:
|
| No flags: 1843ms
|
| -march=native: 2183 ms
|
| -O2: 423 ms
|
| -O2 -march=native: 250 ms
|
| -O3: 425 ms
|
| -O3 -march=native: 255 ms
|
| O3 doesn't seem to be helping in my case.
| Tadpole9181 wrote:
| They didn't use _any_ optimization flags.
| Twirrim wrote:
| Yeah, I was just trying to show the difference. Doing it
| without optimisation flags is an utterly bewildering
| decision by the author.
| jasonjmcghee wrote:
| Seems crazy to me that double would produce that kind of
| speed up. Is float getting emulated somehow? Don't they end
| up the same size?
| pvg wrote:
| Author mentions they didn't use optimization flags but doesn't
| include the compilation details. You can sort of guess that
| (relatively) unoptimized C might perform worse than V8's JIT on
| short, straight computational code - you're more or less
| testing two native code generators doing a simple thing except
| one has more optimizations enabled and wins.
| hyperbrainer wrote:
| Oh, I assumed he still did -O2, and did not do anything else.
| Is that bad to assume?
|
| PS: I do not use C beyond reading some of its code for
| inspiration, so kinda unaware
| pvg wrote:
| It just says 'no optimization flags' and that's that, I
| don't think even the compiler is mentioned so the author is
| not giving you a lot to go on here - you don't even know
| what the default is.
|
| A modern C optimizing compiler is going to go absolutely
| HAM on this sort of code snippet if you tell it to and do
| next to nothing if you don't - that's, roughly, the
| explanation for this little discrepancy.
| anon946 wrote:
| IMO, not using any optimization flags with C is somewhat
| arbitrary, since the compiler writers could have just decided
| that by default we'll do thing X, Y, and Z, and then you'd need
| to turn them off explicitly.
|
| FWIW, without -O, with -O, and with -O4, I get 2500ms, 1500ms,
| and 550ms respectively. I didn't bother to look at the .S to see
| the code improvements. (Of course, I edited the code to output
| the results, otherwise, it just optimized out everything.)
| hyperbrainer wrote:
| > (Of course, I edited the code to output the results,
| otherwise, it just optimized out everything.)
|
| O(1) :)
| caranea wrote:
| Thanks for posting your results!
|
| Since I was already set on writing in-browser particle life, I
| didn't benchmark C code with different flags.
| anon946 wrote:
| Completely reasonable. I'd probably edit your blog post a bit
| to indicate that.
| TylerE wrote:
| Should also test -Os when doing this sort of thing. Sometimes
| the reduced size greatly improves cache behavior, and even when
| not it's often outright competitive with at least -O2 anyway
| (usually compiles faster too!)
| abainbridge wrote:
| One optimization for the C code is to put "f" suffixes on the
| floating point constants. For example convert this line:
| t[i] += 0.02 * (float)j;
|
| to: t[i] += 0.02f * (float)j;
|
| I believe this helps because 0.02 is a double and doing double
| * float and then converting the result to float can produce a
| different answer to just doing float * float. The compiler has
| to do the slow version because that's what you asked for.
|
| Adding the -ffast-math switch appears to make no difference.
| I'm never sure what -ffast-math does exactly.
|
| Minimal case on Godbolt:
|
| https://godbolt.org/z/W18YsnMY5 - without the f
|
| https://godbolt.org/z/oc1s8WKeG - with the f
| a1369209993 wrote:
| > I believe this helps because 0.02 is a double and [...] can
| produce a different answer
|
| In principle, not quite. The real/unavoidable(-by-the-
| compiler) problem is that 0.02 is a not a diadic rational
| (not representable exactly as some integer over a power of
| two). So its representation (rounded to 52 bits) as a double
| is a different real number than its representation (rounded
| to 23 bits) as a float. (This is the same problem as rounding
| pi or e to a double/float, but people tend to forget that it
| applies to all _diadic_ irrationals, not just regular
| irrationals.)
|
| If, instead of `0.02f` you replaced `0.02` with
| `(double)0.02f` or `0.015625`, the optimization should in
| theory still apply (although missed optimization complier
| bugs are of course possible).
| abainbridge wrote:
| OK, right, that's the clarity of thought I was missing.
|
| But in this case the compiler still misses the optimization
| with '(double)0.02f'. https://godbolt.org/z/az7819nKM
|
| I think this is because the optimization isn't safe. I
| wrote a program to find a counter example to your claim
| that "the optimization should in theory still apply". It
| found one. Here's the code: #include
| <stdio.h> #include <stdlib.h> float
| mul_as_float(float t) { t += 0.02f * (float)17;
| return t; } float mul_as_double(float
| t) { t += (double)0.02f * (float)17;
| return t; } int main() {
| while (1) { unsigned r = rand();
| float t = *((float*)&r); float result1
| = mul_as_float(t); float result2 =
| mul_as_double(t); if (result1 != result2) {
| printf("Counter example when t is %f (0x%x)\n", t,
| *((unsigned*)&t)); printf("result1 is
| %f (0x%x)\n", result1, *((unsigned*)&result1));
| printf("result2 is %f (0x%x)\n", result2,
| *((unsigned*)&result2)); return 0;
| } } }
|
| It outputs: Counter example when t is
| 0.000000 (0x3477d43f) result1 is 0.340000
| (0x3eae1483) result2 is 0.340000 (0x3eae1482)
|
| What do you think?
| a1369209993 wrote:
| On my machine, the complier constant-folds the
| multiplication, producing a single-precision add for
| `mul_as_float` and a convert-t-to-double, double-
| precision-add, convert-sum-to-single for `mul_as_double`.
| I missed the `+=` in your original comment, but adding a
| float to a double does implicitly promote it like that,
| so you'd actually need: t +=
| (float)((double)0.02f * (float)17);
|
| to achieve the "and then converting the result [of the
| multiplication] to float" (rather than keeping it a
| double for the addition) from your original comment.
| (With the above line in mul_as_double, your test code no
| longer finds a counterexample, at least when I ran it.)
|
| If you ask for higher-precision intermediates, even
| implicitly, floating-point compliers will typically give
| them to you, hoped-for efficiency of single-precision be
| damned.
| abainbridge wrote:
| Ah right, yep.
| CyberDildonics wrote:
| This is fairly trivial stuff with underwhelming results. 4
| million particles isn't much on modern hardware, even with
| spatial partitioning. The hardest part of spatial partitioning is
| probably the one dimensional partition algorithm in the first
| place, which can be found here:
|
| https://en.cppreference.com/w/cpp/algorithm/partition
| iainmerrick wrote:
| It's not rocket science, but "trivial" is very harsh. This
| stuff is fiddly to get right. The author (who I assume is a
| student) seems to have done a good job and it's a nice writeup.
| CyberDildonics wrote:
| I guess you could argue that, but it's strange that people
| want to give a student's first project (with a lot of huge
| mistakes like thinking node is going to outperform C because
| they didn't turn optimizations on) so much interest. I think
| people are taken in by big numbers and assume there is
| something cutting edge because they don't know any better.
|
| The results are just some particles in an extremely basic
| pattern, there isn't a lot of payoff.
| caranea wrote:
| One point needs correcting though - nowhere in the article
| I say that node is outperforming C, on the contrary - I
| stressed that I didn't use any flags so that people don't
| come to a conclusion that C would be always slower, and I
| explicitly mentioned not get fixated on such benchmarks.
|
| What it was meant to show is that V8 is *good enough* to
| even consider it for the job :)
| CyberDildonics wrote:
| You benchmarked three different programs and one was
| compiled to be slow. Why give times for a debug build
| when the other two aren't? Showing off performance
| metrics that are both apples to oranges while also giving
| times for something not made to run fast is total
| nonsense.
|
| Do you really not think this is a mistake? Most people
| would see C getting outperformed and realize there is
| something wrong. I'm shocked anyone would both this then
| try to rationalize it after.
| atulvi wrote:
| 4M particles is not the question. what's the framerate?
| bee_rider wrote:
| You can go to the "next" story at the bottom of the post, and
| then the author has some links to an online implementation.
| jasonjmcghee wrote:
| Would vary significantly based on the GPU, I would assume.
| Jupe wrote:
| This is pretty neat...
|
| Having an interest in the space for many years (on-and-off; life
| gets busy sometimes) - I've often lamented the lack of good
| pixel-level performance optimizations in graphics cards.
| Everything seems hell-bent on polygons.
|
| Some years ago, DirectDraw on Windows was an excellent way to
| optimize the graphics portion of these types of things, but that
| all went away as "3D Is The Way" mentality took over.
|
| My explorations of Processing.* were neat, but lacked the IDE and
| library support I like as a developer.
| pixelpoet wrote:
| There's no lack of "pixel-level performance optimisations" on
| GPUs, and just because basically the whole world seems to think
| that graphics programming == using some rasterisation library
| like OpenGL or DX or Vulkan or whatever for realtime
| applications, doesn't mean you're forced to use it or that
| that's all there is.
|
| Just fire up OpenCL (or CUDA, if you must) and start executing
| some data-parallel kernels.
| Jupe wrote:
| That's all a little beyond me at this point. But very
| compelling!
|
| As the original simulation shows, the idea is to calculate
| the difference and proximity of the pixels. So the drawing
| part is just a projection...
|
| The ease of getting a memory buffer pointer, setting pixels
| within my calculation loop with simple offsets was very
| simple and compelling. I think I should look deeper into
| OpenCL/CUDA for this use-case, as I mentioned in the other
| response to my comment.
|
| Thanks!
| pixelpoet wrote:
| No problem, and I highly recommend using OpenCL to get
| started, it's actually pretty short and straightforward
| (lookin' at you, Vulkan compute!). So basically, the stuff
| in the for-loops you had before, that's the kernel you'll
| be writing.
|
| I daresay it's actually easier to render things this way
| than getting lost in the weeds with API docs etc. A simple
| Mandelbrot hello-world example should be easy to understand
| and get compiling, and you can go from there.
| dartos wrote:
| Is vulkan strictly a rasterization library?
|
| I thought you could also do numeric calculations with compute
| shaders.
| pixelpoet wrote:
| Yep you can, but there's still an astronomical amount of
| crap you have to do that has nothing to do with pure
| compute. OpenCL has none of these problems, and is IMO just
| the perfect compute API.
| smlacy wrote:
| The thing you describe as "per-pixel optimizations" is exactly
| what a fragment shader is? You don't need "a polygon" per say,
| you can just run said code over ever pixel in the frame for
| each frame. This is how simple screenspace effects are
| implemented, compositing, HDR, etc.
| Jupe wrote:
| Interesting. While I've played with ShaderToy a bit, I've not
| gotten such a thing to run locally. Maybe I should revisit
| this..
|
| I guess all that I'd need would be in-memory buffer of my
| field, then a single function to copy the buffer (or section
| of the buffer if I want larger than screen fields) to the
| video buffer.
| bee_rider wrote:
| Since I'm on my phone, I can only try the 2D js versions, but the
| quadtree version does seem to provide some really obvious and
| nice frame rate improvements. It seems like good work, although I
| don't know much about this field.
|
| Sidenote:
|
| I'm often too self conscious to put anything technical online
| with my real name on it, probably a bit of imposter syndrome or
| something. I think based on
|
| > For one of my programming classes this semester, my task was
| writing a program in Python. It was only an introductory course,
| so it didn't have to be anything fancy.
|
| the author is sort of early on in their career. I often suspect
| I'd be better off if I'd gotten over myself and thrown more early
| experiments online. I dunno, maybe this stuff just isn't a big
| deal for some people, but I have to applaud the bravery.
| peter_l_downs wrote:
| I was having trouble finding the link to the actual project, so
| here you go:
|
| - Github repo https://github.com/Caranea/gpulife
|
| - Finished particle simulator https://webgpulife.netlify.app/
| danielam wrote:
| This reminds me of a paper[0] I co-wrote during a college
| internship that involved simulating the evolution of a Brownian
| system of particles with production and annihilation rules as the
| evolution of probability density distributions for each particle
| type within a partitioned cubic periodic box. Because we were
| interested in how these particles segregated spatially,
| visualization was a matter of finding the boundaries separating
| each domain of particles. I made some fun little videos of the
| evolution of the system over time (which tracks very
| conspicuously the entropy production, plotted in the paper), but
| unfortunately, they've since been taken down from the institute's
| website where they were hosted. (The paper does contain some
| snapshots that went into these visualizations, however.)
|
| [0] https://pubs.aip.org/aip/jcp/article-
| abstract/122/17/174105/...
| Twirrim wrote:
| The thing that strikes me as particularly weird about this is the
| use of numpy there. In other languages, they're using native
| code. In python they're reaching out to numpy, which is a great
| library, but not awesome inside a hot loop unless you're keeping
| the operation you're carrying out within numpy itself. This
| means, right in that hot loop, they're doing a lot of translating
| of numbers between python representation and native c
| representation.
|
| Cutting numpy out by making the line "t = [0.0]*l" gets it down
| to 17059 ms, without attempting any other optimisations. Using
| that plus pypy (so you get the JIT that you have with
| javascript/v8) gets it down to 958 ms.
|
| To show the cost of those translations, sticking with numpy. If
| we switch that inner loop to: temp_t_i =
| t[i] temp_t_i += 0.02 * j temp_t_i *=
| 0.03 * j temp_t_i -= 0.04 * j temp_t_i /=
| 0.05 * (j+1) temp_t_i = temp_t_i
|
| It speeds us up from 47552 ms to 28251 ms. Almost half the
| execution time. That's still doing two hops back and forth,
| though. If you cut it down to a single line it's even faster at
| 18458 ms, cutting execution time down to about a third of the
| original example. Pypy isn't able to help here at all, this is
| sort of a pathological case for it.
|
| edit: I'll add, I'm not that good with numpy, rarely use it
| myself. Not sure if it's possible to do that inner loop all
| within numpy somehow. I imagine that'd be a lot faster still.
| nerpderp82 wrote:
| This is the problem that Numba was designed to solve.
|
| https://numba.readthedocs.io/en/stable/user/5minguide.html
| cl3misch wrote:
| No. The problem in this post can be vectorized (== expressed
| as array OPs) with idiomatic numpy, making it very fast (see
| sibling comment).
|
| Numba is designed to speed up code where looping is
| unavoidable, i.e. the code can't be (easily) expressed as
| array OPs.
| GeompMankle wrote:
| Negative again, a series of array operations which are
| individually idiomatic numpy like this will run very very
| fast in numba as it can coalesce the ops into a single pass
| through memory. Numpy can't do this and has to pass through
| the array for each individually array operation. There's
| nothing wrong with straight numpy but if you want it
| compiled-C fast for the whole ensemble of array ops, you
| need a JIT.
| cl3misch wrote:
| Ah, I was under the impression that using array OPs
| inside a numba njit gave _worse_ performance. Has this
| changed, or is my memory tricking me?
|
| Do you have a source for this? I have not seen it in the
| numba docs.
| physicsguy wrote:
| Numba replaces ops with LLVM byte code equivalent so the
| compiler will optimise out and coalesce the operations as
| part of the optimisation stage.
|
| If you want to look at the sort of things compilers do
| though, take a look at "common subexpression elimination"
| bscottmay wrote:
| Yeah, an all-numpy version runs in less than 1ms on my M1 air
| import numpy as np l = 10_000 t = np.empty(l,
| dtype=np.float32) j = np.arange(l) t = 0.02 \* j
| t *= (0.03 * j) t -= (0.04 \* j) t /= 0.05 \* (j +
| 1)
| Twirrim wrote:
| So rather than spend 10-20 minutes reading about numpy, the
| author wrote 3 other implementations...?
|
| The fact that they ran the C code without the optimisation
| flags and compared it that way makes me think Javascript was
| what they actually wanted to write this one in anyway.
| chris_va wrote:
| Sounds like they are slowly learning electrostatic PIC codes from
| first principles.
___________________________________________________________________
(page generated 2024-03-18 23:00 UTC)