Post B1PR3do0G2TTKOwwD2 by amonakov@mastodon.gamedev.place
(DIR) More posts by amonakov@mastodon.gamedev.place
(DIR) Post #B1ODUJ0M6qgc7NaVhg by wolf480pl@mstdn.io
2025-12-19T00:29:19Z
0 likes, 0 repeats
@algernon how do you compile the rust version?I'm guessing you need to use cargo to get the fastrand library?
(DIR) Post #B1PQfyV9eujQZ7qfey by wolf480pl@mstdn.io
2025-12-19T12:07:09Z
0 likes, 0 repeats
@algernon I have thingsOG print10.c: 556msprint10-gr.c:#include <stdio.h> #include <stdlib.h> #include <time.h> #include <unistd.h> #include <sys/random.h> #define ITERATIONS 100000000 int main() { static char maze[ITERATIONS]; getrandom(maze, sizeof(maze), 0); for (long i = 0; i < ITERATIONS; i++) { maze[i] = maze[i] % 2 ? '/' : '\\'; } write(1, maze, ITERATIONS); return 0; }190ms1/
(DIR) Post #B1PQfzf7LMIMAJW8pM by wolf480pl@mstdn.io
2025-12-19T12:09:13Z
0 likes, 0 repeats
@algernon getrandom is faster than glibc's rand(), but it's still quite slow (it's a CSPRNG).Rust's fastrand uses wyhash, soprint10-wh.c:int main() { static char maze[ITERATIONS]; uint64_t seed = time(NULL); for (long i = 0; i < ITERATIONS; i++) { maze[i] = wyrand(&seed) % 2 ? '/' : '\\'; } write(1, maze, ITERATIONS); return 0;}125ms2/
(DIR) Post #B1PQg0H34KdW3xGQwy by wolf480pl@mstdn.io
2025-12-19T12:11:50Z
0 likes, 0 repeats
@algernon but that loop didn't get vectorized, so:print10-wh-buf.c: int main() { static char maze[ITERATIONS]; uint64_t seed = time(NULL); for (uint64_t *ptr = (uint64_t*) maze; ptr < (uint64_t*)(&maze[ITERATIONS]); ++ptr) { *ptr = wyrand(&seed); } for (long i = 0; i < ITERATIONS; i++) { maze[i] = maze[i] % 2 ? '/' : '\\'; } write(1, maze, ITERATIONS); return 0; }50ms3/
(DIR) Post #B1PQg1524UcQYyeLrM by wolf480pl@mstdn.io
2025-12-19T12:13:35Z
0 likes, 0 repeats
@algernon finally:print10-wh-fan.c:int main() { static char maze[ITERATIONS]; uint64_t seed = time(NULL); union { uint64_t u; char b[8]; } rand; for (long i = 0; i < ITERATIONS; i+=8) { rand.u = wyrand(&seed); for (long j = 0; j < 8; j++) { maze[i+j] = rand.b[j] % 2 ? '/' : '\\'; } } write(1, maze, ITERATIONS); return 0;}35ms
(DIR) Post #B1PQg1qBFCKgvCi0Lg by wolf480pl@mstdn.io
2025-12-19T14:31:46Z
0 likes, 0 repeats
@algernon Hmm looks like a large part of the saving when going from print10-wh to print10-wh-buf is from calling wyrand 8x less often.print10-wh-buf.c compiled with -fno-tree-vectorize takes 90ms to run
(DIR) Post #B1PR3do0G2TTKOwwD2 by amonakov@mastodon.gamedev.place
2025-12-19T14:36:03Z
0 likes, 0 repeats
@wolf480pl @algernon at this point page faults on the giant maze array start to dominate, so you can get another 2x gain by outputting in a reasonable-size chunks
(DIR) Post #B1PRTBTCEP12yZOggy by wolf480pl@mstdn.io
2025-12-19T14:40:43Z
0 likes, 0 repeats
@amonakovWell this one was 20ms sys, which could be pagefaults... gonna try smaller chunks later.@algernon
(DIR) Post #B1PWBGjIR8wfqZCkRk by wolf480pl@not.acu.lt
2025-12-19T15:32:44.336Z
1 likes, 0 repeats
@wolf480pl@mstdn.io @amonakov@mastodon.gamedev.place @algernon@come-from.mad-scientist.club yupprint10-wh-fan-chunked.c:#define ITERATIONS 100000000#define CHUNK (64 * 4096)int main() { static char buf[CHUNK]; uint64_t seed = time(NULL); union { uint64_t u; char b[8]; } rand; for (long i = 0; i < ITERATIONS; i+=CHUNK) { if (i) { write(1, buf, sizeof(buf)); } for (long k = 0; k < CHUNK; k += 8) { rand.u = wyrand(&seed); for (long j = 0; j < 8; j++) { buf[k+j] = rand.b[j] % 2 ? '/' : '\\'; } } } write(1, buf, ITERATIONS % CHUNK); return 0;}~16ms most of the time, 24ms if unlucky
(DIR) Post #B1PWBHqQI8ExIxXxC4 by wolf480pl@mstdn.io
2025-12-19T15:33:25Z
0 likes, 0 repeats
(posted from @wolf480pl@not.acu.lt because that one has a larger character limit)@amonakov @algernon
(DIR) Post #B1Ph6yCqSnLcjdB8QS by amonakov@mastodon.gamedev.place
2025-12-19T17:35:58Z
0 likes, 0 repeats
@wolf480pl @algernon at this point you are spending 99% in the inner loop, so you can switch from ms to cycles per byte and cross-check what you're gettingperf stat ./a.out 64,732,836 cycles163,468,563 instructionsthat's 13 instructions (can count 'em in asm) and 5 cpu cycles (on my SNB) per 8 bytes
(DIR) Post #B1PjMs3BJnYaBVMtcG by amonakov@mastodon.gamedev.place
2025-12-19T17:38:05Z
0 likes, 0 repeats
@wolf480pl @algernon and you can extract the loop from the asm, assemble it to an object file, feed it to uiCA, and witness it estimate the 5 cycles figure correctly:uiCA.py -arch all loop.oThroughput (in cycles per iteration): 3.22 - 5.00 - 3.22 on RKL - 3.49 on ICL, TGL - 3.52 on CFL, CLX, KBL, SKL, SKX - 3.57 on HSW - 4.00 on BDW - 4.58 on IVB - 5.00 on SNB
(DIR) Post #B1PjMsp2Rrq0Zvl7D6 by wolf480pl@mstdn.io
2025-12-19T18:01:15Z
0 likes, 0 repeats
@amonakov @algernon I'm getting225 442 910 instructions:u # 5,09 insn per cycle # 0,00 stalled cycles per insn 44 249 926 cycles:u # 1,435 GHzpython>>> 225442910 / (10**8 / 8)18.0354328>>> 44249926 / (10**8 / 8)3.5399940818 insn, which is what mine compiled to (gcc 15.2.1 -g -O3)3.5 cycles per 8 bytes, on Ryzen Renoirdon't have uiCA installed, and the web versions seems to be down :/it doesn't seem to support Zen either way :/
(DIR) Post #B1PkBb5k87JGMBkqkC by amonakov@mastodon.gamedev.place
2025-12-19T18:10:24Z
0 likes, 0 repeats
@wolf480pl @algernon right, sorry, 13 insns was with -march=sandybridge (otherwise 16 with gcc-13 -O2)
(DIR) Post #B1PlJGgxxWMJMLO3tI by wolf480pl@mstdn.io
2025-12-19T18:23:01Z
0 likes, 0 repeats
@amonakov @algernon huh...looks like gcc 15:- allocates registers a bit differently, requiring a mov before a mul (shouldn't affect performance, right?)- does pcmpeqb twice for some reasonhttps://godbolt.org/z/TEra4f81sI wonder what -march=znver2 will do
(DIR) Post #B1Pltm1Qx4Mkx9MrTc by wolf480pl@mstdn.io
2025-12-19T18:29:37Z
0 likes, 0 repeats
@amonakov @algernon gcc-15 -g -O3 -march=znver214 insn2.93 cycles per 8 bytesstill a duplicate [v]pcmpeqb
(DIR) Post #B1Plz1yB5RStrSLrf6 by wolf480pl@mstdn.io
2025-12-19T18:30:34Z
0 likes, 0 repeats
@amonakov @algernon This is fun!This feels like Zachtronics or TuringComplete except it's real
(DIR) Post #B1Pm3YVA4kDhupw9wm by harold@mastodon.gamedev.place
2025-12-19T18:31:20Z
0 likes, 0 repeats
@wolf480pl @amonakov @algernon that's a way to negate the mask, which isn't necessary because the blend could be done the other way around
(DIR) Post #B1PmCOwkyBuCt73w2a by wolf480pl@mstdn.io
2025-12-19T18:32:59Z
0 likes, 0 repeats
@harold @amonakov @algernon riiight, makes sense now.It felt like !!x to turn it into boolean but I didn't realize you can just swap and / andn (or blend args)
(DIR) Post #B1Pqw2dmdQPoqQCr5s by amonakov@mastodon.gamedev.place
2025-12-19T19:26:01Z
0 likes, 0 repeats
@wolf480pl @algernon > shouldn't affect performance, right?the mov becomes "free" after renaming, but the loop is running at 5 IPC, which is close to pipeline throughput limits before the renamer; so it probably has a bit of a negative effect
(DIR) Post #B1Pr7reVJbWmEvBHU0 by wolf480pl@mstdn.io
2025-12-19T19:28:12Z
0 likes, 0 repeats
@amonakov @algernon oh, so it could be decode-limited?
(DIR) Post #B1PrHgZFNoBCkyPBL6 by amonakov@mastodon.gamedev.place
2025-12-19T19:29:56Z
0 likes, 0 repeats
@wolf480pl @algernon it's going to be running out of opcache, so dispatch, not decode: https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Individual_Core