Post B1PR3do0G2TTKOwwD2 by amonakov@mastodon.gamedev.place
 (DIR) More posts by amonakov@mastodon.gamedev.place
 (DIR) Post #B1ODUJ0M6qgc7NaVhg by wolf480pl@mstdn.io
       2025-12-19T00:29:19Z
       
       0 likes, 0 repeats
       
       @algernon how do you compile the rust version?I'm guessing you need to use cargo to get the fastrand library?
       
 (DIR) Post #B1PQfyV9eujQZ7qfey by wolf480pl@mstdn.io
       2025-12-19T12:07:09Z
       
       0 likes, 0 repeats
       
       @algernon I have thingsOG print10.c: 556msprint10-gr.c:#include <stdio.h>  #include <stdlib.h>  #include <time.h>  #include <unistd.h>  #include <sys/random.h>    #define ITERATIONS 100000000    int main() {      static char maze[ITERATIONS];        getrandom(maze, sizeof(maze), 0);        for (long i = 0; i < ITERATIONS; i++) {          maze[i] = maze[i] % 2 ? '/' : '\\';      }      write(1, maze, ITERATIONS);        return 0;  }190ms1/
       
 (DIR) Post #B1PQfzf7LMIMAJW8pM by wolf480pl@mstdn.io
       2025-12-19T12:09:13Z
       
       0 likes, 0 repeats
       
       @algernon getrandom is faster than glibc's rand(), but it's still quite slow (it's a CSPRNG).Rust's fastrand uses wyhash, soprint10-wh.c:int main() {    static char maze[ITERATIONS];    uint64_t seed = time(NULL);    for (long i = 0; i < ITERATIONS; i++) {        maze[i] = wyrand(&seed) % 2 ? '/' : '\\';    }    write(1, maze, ITERATIONS);    return 0;}125ms2/
       
 (DIR) Post #B1PQg0H34KdW3xGQwy by wolf480pl@mstdn.io
       2025-12-19T12:11:50Z
       
       0 likes, 0 repeats
       
       @algernon but that loop didn't get vectorized, so:print10-wh-buf.c:  int main() {      static char maze[ITERATIONS];        uint64_t seed = time(NULL);      for (uint64_t *ptr = (uint64_t*) maze; ptr < (uint64_t*)(&maze[ITERATIONS]); ++ptr) {          *ptr = wyrand(&seed);      }        for (long i = 0; i < ITERATIONS; i++) {          maze[i] = maze[i] % 2 ? '/' : '\\';      }      write(1, maze, ITERATIONS);        return 0;  }50ms3/
       
 (DIR) Post #B1PQg1524UcQYyeLrM by wolf480pl@mstdn.io
       2025-12-19T12:13:35Z
       
       0 likes, 0 repeats
       
       @algernon finally:print10-wh-fan.c:int main() {    static char maze[ITERATIONS];    uint64_t seed = time(NULL);    union {      uint64_t u;      char b[8];    } rand;    for (long i = 0; i < ITERATIONS; i+=8) {        rand.u = wyrand(&seed);        for (long j = 0; j < 8; j++) {          maze[i+j] = rand.b[j] % 2 ? '/' : '\\';        }    }    write(1, maze, ITERATIONS);    return 0;}35ms
       
 (DIR) Post #B1PQg1qBFCKgvCi0Lg by wolf480pl@mstdn.io
       2025-12-19T14:31:46Z
       
       0 likes, 0 repeats
       
       @algernon Hmm looks like a large part of the saving when going from print10-wh to print10-wh-buf is from calling wyrand 8x less often.print10-wh-buf.c compiled with -fno-tree-vectorize takes 90ms to run
       
 (DIR) Post #B1PR3do0G2TTKOwwD2 by amonakov@mastodon.gamedev.place
       2025-12-19T14:36:03Z
       
       0 likes, 0 repeats
       
       @wolf480pl @algernon at this point page faults on the giant maze array start to dominate, so you can get another 2x gain by outputting in a reasonable-size chunks
       
 (DIR) Post #B1PRTBTCEP12yZOggy by wolf480pl@mstdn.io
       2025-12-19T14:40:43Z
       
       0 likes, 0 repeats
       
       @amonakovWell this one was 20ms sys, which could be pagefaults... gonna try smaller chunks later.@algernon
       
 (DIR) Post #B1PWBGjIR8wfqZCkRk by wolf480pl@not.acu.lt
       2025-12-19T15:32:44.336Z
       
       1 likes, 0 repeats
       
       @wolf480pl@mstdn.io @amonakov@mastodon.gamedev.place @algernon@come-from.mad-scientist.club yupprint10-wh-fan-chunked.c:#define ITERATIONS 100000000#define CHUNK (64 * 4096)int main() {  static char buf[CHUNK];  uint64_t seed = time(NULL);  union {    uint64_t u;    char b[8];  } rand;  for (long i = 0; i < ITERATIONS; i+=CHUNK) {    if (i) {      write(1, buf, sizeof(buf));    }    for (long k = 0; k < CHUNK; k += 8) {      rand.u = wyrand(&seed);      for (long j = 0; j < 8; j++) {        buf[k+j] = rand.b[j] % 2 ? '/' : '\\';      }    }  }  write(1, buf, ITERATIONS % CHUNK);  return 0;}~16ms most of the time, 24ms if unlucky
       
 (DIR) Post #B1PWBHqQI8ExIxXxC4 by wolf480pl@mstdn.io
       2025-12-19T15:33:25Z
       
       0 likes, 0 repeats
       
       (posted from @wolf480pl@not.acu.lt  because that one has a larger character limit)@amonakov @algernon
       
 (DIR) Post #B1Ph6yCqSnLcjdB8QS by amonakov@mastodon.gamedev.place
       2025-12-19T17:35:58Z
       
       0 likes, 0 repeats
       
       @wolf480pl @algernon at this point you are spending 99% in the inner loop, so you can switch from ms to cycles per byte and cross-check what you're gettingperf stat ./a.out 64,732,836      cycles163,468,563      instructionsthat's 13 instructions (can count 'em in asm) and 5 cpu cycles (on my SNB) per 8 bytes
       
 (DIR) Post #B1PjMs3BJnYaBVMtcG by amonakov@mastodon.gamedev.place
       2025-12-19T17:38:05Z
       
       0 likes, 0 repeats
       
       @wolf480pl @algernon and you can extract the loop from the asm, assemble it to an object file, feed it to uiCA, and witness it estimate the 5 cycles figure correctly:uiCA.py -arch all loop.oThroughput (in cycles per iteration): 3.22 - 5.00    - 3.22 on RKL    - 3.49 on ICL, TGL    - 3.52 on CFL, CLX, KBL, SKL, SKX    - 3.57 on HSW    - 4.00 on BDW    - 4.58 on IVB    - 5.00 on SNB
       
 (DIR) Post #B1PjMsp2Rrq0Zvl7D6 by wolf480pl@mstdn.io
       2025-12-19T18:01:15Z
       
       0 likes, 0 repeats
       
       @amonakov @algernon I'm getting225 442 910 instructions:u # 5,09  insn per cycle #    0,00  stalled cycles per insn    44 249 926 cycles:u #    1,435 GHzpython>>> 225442910 / (10**8 / 8)18.0354328>>> 44249926 / (10**8 / 8)3.5399940818 insn, which is what mine compiled to (gcc 15.2.1 -g -O3)3.5 cycles per 8 bytes, on Ryzen Renoirdon't have uiCA installed, and the web versions seems to be down :/it doesn't seem to support Zen either way :/
       
 (DIR) Post #B1PkBb5k87JGMBkqkC by amonakov@mastodon.gamedev.place
       2025-12-19T18:10:24Z
       
       0 likes, 0 repeats
       
       @wolf480pl @algernon right, sorry, 13 insns was with -march=sandybridge (otherwise 16 with gcc-13 -O2)
       
 (DIR) Post #B1PlJGgxxWMJMLO3tI by wolf480pl@mstdn.io
       2025-12-19T18:23:01Z
       
       0 likes, 0 repeats
       
       @amonakov @algernon huh...looks like gcc 15:- allocates registers a bit differently, requiring a mov before a mul (shouldn't affect performance, right?)- does pcmpeqb twice for some reasonhttps://godbolt.org/z/TEra4f81sI wonder what -march=znver2 will do
       
 (DIR) Post #B1Pltm1Qx4Mkx9MrTc by wolf480pl@mstdn.io
       2025-12-19T18:29:37Z
       
       0 likes, 0 repeats
       
       @amonakov @algernon gcc-15 -g -O3 -march=znver214 insn2.93 cycles per 8 bytesstill a duplicate [v]pcmpeqb
       
 (DIR) Post #B1Plz1yB5RStrSLrf6 by wolf480pl@mstdn.io
       2025-12-19T18:30:34Z
       
       0 likes, 0 repeats
       
       @amonakov @algernon This is fun!This feels like Zachtronics or TuringComplete except it's real
       
 (DIR) Post #B1Pm3YVA4kDhupw9wm by harold@mastodon.gamedev.place
       2025-12-19T18:31:20Z
       
       0 likes, 0 repeats
       
       @wolf480pl @amonakov @algernon that's a way to negate the mask, which isn't necessary because the blend could be done the other way around
       
 (DIR) Post #B1PmCOwkyBuCt73w2a by wolf480pl@mstdn.io
       2025-12-19T18:32:59Z
       
       0 likes, 0 repeats
       
       @harold @amonakov @algernon riiight, makes sense now.It felt like !!x to turn it into boolean but I didn't realize you can just swap and / andn (or blend args)
       
 (DIR) Post #B1Pqw2dmdQPoqQCr5s by amonakov@mastodon.gamedev.place
       2025-12-19T19:26:01Z
       
       0 likes, 0 repeats
       
       @wolf480pl @algernon > shouldn't affect performance, right?the mov becomes "free" after renaming, but the loop is running at 5 IPC, which is close to pipeline throughput limits before the renamer; so it probably has a bit of a negative effect
       
 (DIR) Post #B1Pr7reVJbWmEvBHU0 by wolf480pl@mstdn.io
       2025-12-19T19:28:12Z
       
       0 likes, 0 repeats
       
       @amonakov @algernon oh, so it could be decode-limited?
       
 (DIR) Post #B1PrHgZFNoBCkyPBL6 by amonakov@mastodon.gamedev.place
       2025-12-19T19:29:56Z
       
       0 likes, 0 repeats
       
       @wolf480pl @algernon it's going to be running out of opcache, so dispatch, not decode: https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Individual_Core