[HN Gopher] Six times faster than C
___________________________________________________________________
Six times faster than C
Author : 414owen
Score : 115 points
Date : 2023-07-06 16:20 UTC (6 hours ago)
(HTM) web link (owen.cafe)
(TXT) w3m dump (owen.cafe)
| anthony88 wrote:
| [flagged]
| [deleted]
| gavinray wrote:
| Fantastic post, I appreciated that the ASM was displayed in tabs
| as both "standard" and "visual-arrows"-annotated.
|
| Kept me reading into the follow-up article.
|
| Also, I love the UI of this blog.
| RobotToaster wrote:
| Was the C compiled with optimisation enabled?
| 414owen wrote:
| Yes, I explained in the `Benchmarking setup` section that I
| used `march=native`, but I guess I forgot to mention I used
| -O3.
| sltkr wrote:
| How much faster is this: int run_switches(const
| char *buf) { size_t len = strlen(buf); int
| res = 0; for (int i = 0; i < len; ++i) {
| res += (buf[i] == 's') - (buf[i] == 'p'); }
| return res; }
|
| strlen() should be implemented in a pretty fast way, and after
| the buffer size is known, the compiler can autovectorize the
| inner loop, which does happen in practice:
| https://gcc.godbolt.org/z/qYfadPYoq
| xoranth wrote:
| I think I managed to improve on both this post, and its sequel,
| at the cost of specializing the function for the case of a string
| made only of 's' and 'p'.
|
| The benchmark only tests strings made of 's' and 'p', so I think
| it is fair.
|
| The idea is as follow. We want to increase `res` by one when the
| next character is `s`. Naively, we might try something like this:
| res += (c - 'r'); // is `res += 1` when c == 's'
|
| This doesn't work, as `'p' - 'r' == -2`, and we'd need it to be
| -1.
|
| But `'p' - 'r'`, when viewer as an unsigned integer, underflows,
| setting the carry flag. Turns out x64 has an instruction (adc)
| that adds two registers _plus_ the carry flag.
|
| Therefore we can replace two `cmp, cmov` with one `sub, adc`:
| run_switches: xor eax, eax # res =
| 0 loop: movsx ecx, byte ptr [rdi]
| test ecx, ecx je ret inc
| rdi sub ecx, 'r' adc eax,
| ecx # Magic happens here jmp loop
| ret: ret
|
| Benchmarks are as follows (`bench-x64-8` is the asm above):
| Summary '01-six-times-faster-than-c/bench-x64-8 1000 1'
| ran 1.08 +- 0.00 times faster than '02-the-same-
| speed-as-c/bench-c-4-clang 1000 1' 1.66 +- 0.00 times
| faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'
|
| Of course, one could improve things further using SWAR/SIMD...
| 414owen wrote:
| Very interesting approach. I should probably have specified
| that the somewhat naive assembly in `02-the-same-speed-
| as-c/loop-5.x64.s` is the fastest version I have.
|
| On my machine I'm getting 0.244s for `loop-5.x64.s` and 0.422s
| for your implementation above.
|
| I'm not sure why exactly we're seeing this discrepancy, and for
| what it's worth your implementation looks faster to me. I guess
| this is why you need to always benchmark on the hardware you're
| going to be running the code on...
| xoranth wrote:
| I rerun the benchmark vs loop-5 and loop-7 from the second
| post. Runtime is basically the same on my machine.
|
| I would have expected yours to be faster given that it needs
| to execute fewer instructions per loop iteration. Though
| maybe the CPU can run `adc` on more ports compared to a load
| from memory? Summary '01-six-
| times-faster-than-c/bench-x64-8 1000 1' ran 1.00
| +- 0.00 times faster than '02-the-same-speed-as-c/bench-x64-7
| 1000 1' 1.66 +- 0.00 times faster than '01-six-
| times-faster-than-c/bench-x64-7 1000 1' Summary
| '01-six-times-faster-than-c/bench-x64-8 1000 1' ran
| 1.01 +- 0.00 times faster than '02-the-same-speed-
| as-c/bench-x64-5 1000 1' 1.66 +- 0.00 times
| faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'
| torstenvl wrote:
| I'm not so sure that the right take-away is "hand-written
| assembler is 6x faster than C." It's more like "jumps are a lot
| slower than conditional arithmetic." And that can [edit:often] be
| achieved easily in C by simply not using switch statements when
| an if statement or two will work fine.
|
| Rewriting the C function as follows got a 5.5x speedup:
| int run_switches(char *input) { int r = 0;
| char c; while (1) { c = *input++;
| if (c == 's') r++; if (c == 'p') r--;
| if (c == '\0') break; } return r;
| }
|
| Results: [16:50:14 user@boxer ~/looptest] $ gcc
| -O3 bench.c loop1.c -o lone [16:50:37 user@boxer
| ~/looptest] $ gcc -O3 bench.c loop2.c -o ltwo [16:50:47
| user@boxer ~/looptest] $ time ./lone 1000 1 449000
| ./lone 1000 1 3.58s user 0.00s system 99% cpu 3.589 total
| [16:50:57 user@boxer ~/looptest] $ time ./ltwo 1000 1
| 449000 ./ltwo 1000 1 0.65s user 0.00s system 99% cpu
| 0.658 total
| BoppreH wrote:
| What version of GCC are you using? For me both versions perform
| the same, both on Ubuntu and Windows: $ time
| ./lone 1000 1 851000 real
| 0m3.578s user 0m3.574s sys
| 0m0.004s $ time ./ltwo 1000 1
| 851000 real 0m3.583s user
| 0m3.583s sys 0m0.000s $ gcc
| --version gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
| Copyright (C) 2019 Free Software Foundation, Inc.
| This is free software; see the source for copying conditions.
| There is NO warranty; not even for MERCHANTABILITY
| or FITNESS FOR A PARTICULAR PURPOSE.
| torstenvl wrote:
| Sorry, I write 'gcc' purely out of force of habit. I'm using
| Clang/LLVM. [17:23:00 user@boxer
| ~/looptest] $ uname -a Darwin boxer.local 21.6.0
| Darwin Kernel Version 21.6.0: Thu Jun 8 23:57:12 PDT 2023;
| root:xnu-8020.240.18.701.6~1/RELEASE_X86_64 x86_64
| [17:23:47 user@boxer ~/looptest] $ cc -v Apple clang
| version 14.0.0 (clang-1400.0.29.202) Target:
| x86_64-apple-darwin21.6.0 Thread model: posix
| InstalledDir: /Library/Developer/CommandLineTools/usr/bin
|
| Clang generates the sete instruction for me with the above
| code: [17:23:49 user@boxer ~/looptest] $
| gcc -c -O3 loop2.c [17:25:00 user@boxer
| ~/looptest] $ objdump -d --symbolize-operands --x86-asm-
| syntax=intel --no-show-raw-insn loop2.o
| loop2.o: file format mach-o 64-bit x86-64
| Disassembly of section __TEXT,__text:
| 0000000000000000 <_run_switches>: 0:
| push rbp 1: mov rbp, rsp
| 4: xor eax, eax 6: nop word ptr
| cs:[rax + rax] <L0>: 10: movzx
| ecx, byte ptr [rdi] 13: add rdi, 1
| 17: xor edx, edx 19: cmp cl, 115
| 1c: sete dl 1f: add eax, edx
| 21: xor edx, edx 23: cmp cl, 112
| 26: sete dl 29: sub eax, edx
| 2b: test cl, cl 2d: jne <L0>
| 2f: pop rbp 30: ret
| 414owen wrote:
| Nice! There's a part two in which I rewrote the C. I got a 12x
| speedup :)
|
| https://owen.cafe/posts/the-same-speed-as-c/
|
| And as others have pointed out, you can tweak the input, then
| vectorize the algo, if you want to go that route.
|
| I considered this a pedagogical exercise and I sincerely hope
| nobody will start dropping down to assembly without a very good
| reason to.
| haberman wrote:
| > jumps are a lot slower than conditional arithmetic.
|
| This statement is true _if_ the jumps are unpredictable. If the
| jumps are predictable, then jumps will be faster.
|
| Linus had a whole rant about this back in the day, arguing that
| cmov is not useful if branches are predictable:
| https://yarchive.net/comp/linux/cmov.html
| torstenvl wrote:
| I haven't run any benchmarks, but jump-if-equal and set-if-
| equal would seem to have the same level of predictability.
|
| My naive, untested intuition is that there's only one
| meaningful difference: the former has to dump the entire
| pipeline on a miss, and the latter only has to nop a single
| instruction on a miss.
|
| But maybe I'm missing something. I'll re-read his rant.
|
| EDIT:
|
| Linus rants a lot, but makes one concrete claim:
| You can always replace it by j<negated
| condition> forward mov ..., %reg forward:
| and assuming the branch is AT ALL predictable (and 95+% of
| all branches are), *the branch-over will actually be
| a LOT better for a CPU.*
|
| So, I decided to test that. [18:50:14
| user@boxer ~/src/looptest] $ diff -u loop2.s loop4.s
| --- loop2.s 2023-07-06 18:40:11.000000000 -0400 +++
| loop4.s 2023-07-06 18:46:58.000000000 -0400 @@ -17,11
| +17,15 @@ incq %rdi xorl %edx, %edx
| cmpb $115, %cl - sete %dl + jne
| _run_switches_jmptgt1 + mov $1, %dl
| +_run_switches_jmptgt1: addl %edx, %eax
| xorl %edx, %edx cmpb $112, %cl - sete %dl
| + jne _run_switches_jmptgt2 + mov $1, %dl
| +_run_switches_jmptgt2: subl %edx, %eax
| testb %cl, %cl jne LBB0_1 [18:50:29
| user@boxer ~/src/looptest] $ gcc -O3 bench.c loop2.s -o l2
| [18:50:57 user@boxer ~/src/looptest] $ gcc -O3 bench.c
| loop4.s -o l4 [18:51:02 user@boxer ~/src/looptest] $
| time ./l2 1000 1 449000 ./l2 1000 1 0.69s
| user 0.00s system 99% cpu 0.697 total [18:51:09
| user@boxer ~/src/looptest] $ time ./l4 1000 1 449000
| ./l4 1000 1 4.53s user 0.01s system 99% cpu 4.542 total
|
| I feel pretty confident that Linus has made a poor prediction
| about poor prediction here.
| haberman wrote:
| > jump-if-equal and set-if-equal would seem to have the
| same level of predictability.
|
| The difference is that branches have dedicated hardware
| (branch predictors) that will speculatively execute
| subsequent instructions based on their best guess about
| which way the branch will go. Whereas conditional moves
| cannot execute any subsequent instructions until the
| correct value is available.
|
| Put another way, CPUs have control flow speculation, but
| not conditional move speculation. I don't know if
| conditional move speculation would be a feasible thing to
| implement or not, but I'm pretty sure that no mainstream
| CPUs have such a feature.
| DeathArrow wrote:
| Shouldn't the compiler be able to do that, too?
| ModernMech wrote:
| Yes, there's always the "sufficiently smart compiler" that
| can generate this code. Question is, does that compiler
| exist?
| Groxx wrote:
| I sure hope so. The semantics are trivially identical, the
| optimizations should be as well, by default - they should
| depend on semantics, not syntax. And GCC in another comment
| under this thread seems to be doing similar or identical
| optimizations in both cases.
|
| I wholly admit that this implies nothing about _all_
| optimizers. But it 's a pretty reasonable one to expect.
| charcircuit wrote:
| >does that compiler exist?
|
| and if so are the compile times worth it
| thomasahle wrote:
| I tried https://godbolt.org/, and neither Clang nor GCC trunk
| give the same code for the two programs.
|
| Pretty shocking for such a simple program.
| p1necone wrote:
| Is rewriting switch statements to a bunch of ifs _always_
| faster? Or is there some number of cases where the switch is
| faster? Seems like it should be added as a compiler
| optimization if it 's consistent.
| torstenvl wrote:
| There's an error in the pseudocode. cmp
| ecx, 's' # if (c == 's') jne loop
| # continue add eax, 1 # res++
| jmp loop # continue
|
| should be cmp ecx, 's' # if
| (c != 's') jne loop # continue
| add eax, 1 # res++ jmp loop
| # continue
| agumonkey wrote:
| I believe the first `jne` should be `je`, right ?
| torstenvl wrote:
| No, the assembler is correct. Jump (early) back to the
| beginning of the loop if not equal to s; otherwise, continue
| executing the next instruction (add eax, 1) and then
| unconditionally jump back to the beginning of the loop.
| [deleted]
| BoppreH wrote:
| You can also use math to avoid most of the jumps:
| int run_switches(char *input) { int res = 0;
| while (true) { char c = *input++; if (c
| == '\0') return res; // Here's the trick:
| res += (c == 's') - (c == 'p'); } }
|
| This gives a 3.7x speed compared to loop-1.c. The lower line
| count is also nice.
| svachalek wrote:
| Nice. The way I read the cmove version, it's more or less this
| except the trick line goes res += (c == 's')
| ? 1 : (c == 'p') ? -1 : 0
|
| I haven't done C in decades so I don't trust myself to
| performance test this but I'm curious how it compares. Pretty
| disappointed that TFA didn't go back and try that in C.
| 414owen wrote:
| A clickbait title for an in-depth look at hand-optimizing a very
| simple loop.
| ftxbro wrote:
| I'm not a compiler expert but if it's a "very simple loop" is
| it still too complex for the compiler to make good machine
| code? Did they use a bad compiler on purpose? Or are computers
| just not yet fast enough to do a good job with very simple
| loops in practical compilers?
| twoodfin wrote:
| This is the right answer:
|
| https://news.ycombinator.com/item?id=36622584
|
| Optimal assembly (forgoing SIMD, at least) for this loop on
| modern x86 is highly dependent on the entropy of the runtime
| data.
| ftxbro wrote:
| OK so they were abusing the benchmark, like the compiler's
| output would be faster on less contrived test data? Do I
| have to search what are fdo or pgo or cmov to understand
| the answer?
| tylerhou wrote:
| The compiler will generate different code if it knew the
| rates at which branches were taken.
|
| If a branch is almost always taken or almost never taken,
| a compiler will want to emit a jump. The frontend will be
| able to predict the jump with high probability, and a
| successfully-predicted jump is "free." The cost of a
| misprediction is paid for by the near-zero cost of the
| many successful predictions.
|
| If a branch is hard to predict (and taking versus not
| taking it would load a different value into a
| register/memory), the compiler wants to emit a
| conditional move (cmov). A conditional move is slightly
| "more expensive" in the backend because the CPU has to
| wait for the condition to resolve before it can execute
| instructions dependent on the output. However, it is much
| cheaper than many mispredicted branches (mispredicts
| around half of the time).
|
| FDO (feedback-directed optimization) or PGO (profile-
| guided optimization) means "run the code on some sample
| input and profile how often branches are taken/not
| taken." It gives the compiler more information to
| generate better code.
|
| The problem with the blog post is that the compiler has
| no idea what the function's input data will look like. It
| (arbitrarily) chose to generate branches instead of
| cmovs. However, if the benchmark input is better suited
| for cmovs, then the benchmark will (wrongly) show that
| the compiler generates "slow" assembly. But that's not a
| fair test, because with PGO/FDO the compiler would
| generate equivalent assembly to the "fast" assembly
| (actually, probably faster). Finally, the human (OP) is
| using their knowledge of the benchmark data "unfairly" to
| write better assembly than the compiler.
|
| The takeaway is: most of the time, one can't optimize
| code/assembly in a vacuum. You also need to know what the
| input data and access patterns look like. FDO/PGO gives
| the compiler more data to understand what the input
| data/access patterns look like.
| ftxbro wrote:
| Thank you this is an amazingly comprehensive answer! Now
| I wonder what would be the workflow for using these
| compiler features. Like if I am a normal or bad C
| programmer and I write my program and use valgrind to
| check that it doesn't have obvious problems and I compile
| it with -march native or whatever, then I can add some
| step to the workflow to somehow re-compile it in a way
| that uses the branching or access patterns of some
| examples that I let it process for that purpose?
| tylerhou wrote:
| Yes, Clang supports FDO, but it might be hard to set up
| (I've never set up FDO myself). You could check out
| https://github.com/google/autofdo and
| https://clang.llvm.org/docs/UsersManual.html#profile-
| guided-....
|
| (People within Google say "FDO", basically everyone else
| says "PGO".)
| cjensen wrote:
| The problem is the author of the article is making some huge
| implicit assumptions that the compiler can't possibly know
| about.
|
| Consider this statement: "However, we know some things about
| this loop. We know that the only time we break out of it is
| when we hit the null terminator ('\0'). The code clang
| generates checks for the null terminator first, but this
| makes no sense."
|
| This statement contains huge assumptions about the lengths of
| the input strings and the frequency of the letters 's' and
| 'p' in the input. And then has the chutzpah to call the
| compiler's failure to read his mind about this as "making no
| sense."
|
| Good first effort by the author, but has not sufficiently
| thought through the problem.
| moonchild wrote:
| > are computers just not yet fast enough to do a good job
| with very simple loops in practical compilers?
|
| The short answer to this question is 'yes', but there are
| some extenuating factors:
|
| - Although we could do interesting things with unlimited
| computational resources, the current crop of c compilers is
| simply not very good, compared with what's possible today.
|
| - Performance is always workload-dependent; the compiler has
| been somewhat shafted here because it doesn't know what sorts
| of inputs the function usually receives. The compiler output
| is better than the 'improved' code for some inputs. (It's
| possible you could get a better result from the existing
| compilers and c code just by using profile-guided
| optimisation.)
|
| - The difference is prone to be more pronounced in simple
| loops than large ones. This is a contrived use-case. There is
| not a factor of 6 of performance hiding in optimised c code
| which could be recovered by doing the sorts of optimisations
| done by the op. Probably something more like 10-20%.
| Const-me wrote:
| I'm probably an optimization expert, and I would solve that
| problem completely differently.
|
| On my computer, the initial C version runs at 389 MB / second. I
| haven't tested the assembly versions, but if they deliver the
| same 6.2x speedup, would result in 2.4 GB/second here.
|
| Here's C++ version which for long buffers exceeds 24 GB/second on
| my computer: https://gist.github.com/Const-
| me/3ade77faad47f0fbb0538965ae7... That's 61x speedup compared to
| the original version, without any assembly, based on AVX2
| intrinsics.
| gavinray wrote:
| Do you know if this is possible using "std::experimental::simd"
| out of curiosity?
|
| https://en.cppreference.com/w/cpp/experimental/simd
| lukas099 wrote:
| Would it be possible to write a code profiler and compiler that
| work together to optimize code based on real-world data? The
| profiler would output data that would feed back into the
| compiler, telling it which branches were selected most often,
| which would recompile optimizing for the profile. Would this even
| work? Has it already been done?
| failuser wrote:
| Having a full-blown predicate support is so nice to have, but it
| interferes with compact instruction encoding.
|
| Such bloated ISA like x86 might actually handle predicate
| support, but who will try such a radical change?
| vardump wrote:
| I think it's straightforward to optimize to a point it's maybe
| about 10x faster than the "optimized" version. The answer is of
| course SIMD vectorization.
| nwallin wrote:
| IMHO the original code wasn't written in a way that's
| particularly friendly to compilers. If you write it like this:
| int run_switches_branchless(const char* s) { int
| result = 0; for (; *s; ++s) { result
| += *s == 's'; result -= *s == 'p'; }
| return result; }
|
| ...the compiler will do all the branchless sete/cmov stuff as it
| sees fit. It will be the same speed as the optimized assembly in
| the post, +/- something insignificant. However it won't unroll
| and vectorize the loop. If you write it like this:
| int run_switches_vectorized(const char* s, size_t size) {
| int result = 0; for (; size--; ++s) {
| result += *s == 's'; result -= *s == 'p';
| } return result; }
|
| It will know the size of the loop, and will unroll it and use
| AVX-512 instructions if they're available. This will be
| substantially faster than the first loop for large inputs,
| although I'm too lazy to benchmark just how much faster it is.
|
| Now, this requires knowing the size of your string in advance,
| and maybe you're the sort of C programmer who doesn't keep track
| of how big your strings are. I'm not your coworker, I don't
| review your code. Do what you want. But you really _really_
| probably shouldn 't.
|
| https://godbolt.org/z/rde51zMd8
| jonny_eh wrote:
| > But you really really probably shouldn't.
|
| Shouldn't "not" keep track of string length?
| darig wrote:
| [dead]
| nwallin wrote:
| Err... yes. You shouldn't not keep track of string/buffer
| sizes.
| 414owen wrote:
| The version that's friendly to the compiler is described in
| part two: https://owen.cafe/posts/the-same-speed-as-c/
|
| It achieves 3.88GiB/s
|
| I intentionally didn't go down the route of vectorizing. I
| wanted to keep the scope of the problem small, and show off the
| assembly tips and tricks in the post, but maybe there's
| potential for a future post, where I pad the input string and
| vectorize the algorithm :)
| aidenn0 wrote:
| A while back, I wrote a UTF-8 decoder in Common Lisp, targeting
| SBCL (it already has one built in, this was an exercise). Pretty
| much all of the optimization win (after the obvious low-hanging
| fruit) was structuring the code so that the compiler would
| generate cmov* instructions rather than branches.
| moonchild wrote:
| Branches are prone to be faster than conditional moves if they
| are correctly predicted, because they do not increase the
| critical path length. And utf-8 decoders are commonly run on
| all-ascii input. What were you benchmarking on?
| aidenn0 wrote:
| I ran separate benchmarks on all-ASCII, BMP-only, and ascii
| with non-BMP. ASCII was _not_ slower on the low-branch
| version.
| whartung wrote:
| What's some examples of the code changes that you made? And did
| you just do repeated disassemblies of the functions to see that
| it was using the correct instructions, or did you do some
| benchmarking to show your changes were actual improvements?
| aidenn0 wrote:
| Gosh, I'd have to see if I can dig it up this was a few years
| ago.
|
| I did all of the above, plus profiling (sb-sprof combined
| with disassemble will show assembly level profiling).
| Sesse__ wrote:
| This code screams for SIMD! If you can change the prototype to
| take an explicit length, you could easily read and process 16
| bytes at a time (the compares will give you values you can just
| add and subtract directly). Heck, even calling strlen() at the
| function's start to get the explicit length would probably be
| worth it.
| camel-cdr wrote:
| I threw together a quick risc-v vectorized implementation:
| size_t run(char *str) { uint8_t *p =
| (uint8_t*)str; long end = 0;
| size_t res = 0, vl; while (1) {
| vl = __riscv_vsetvlmax_e8m8(); vuint8m8_t
| v = __riscv_vle8ff_v_u8m8(p, &vl, vl);
| end = __riscv_vfirst_m_b1(__riscv_vmseq_vx_u8m8_b1(v, '\0', vl),
| vl); if (end >= 0)
| break; res +=
| __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 's', vl), vl);
| res -= __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 'p', vl),
| vl); p += vl; }
| vl = __riscv_vsetvl_e8m8(end); vuint8m8_t v =
| __riscv_vle8_v_u8m8(p, vl); res +=
| __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 's', vl), vl);
| res -= __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 'p', vl),
| vl); return res; }
|
| Here are the results from the above, the switch and the table c
| implementation, ran on my mangopi mq pro (C906, in order rv64gc
| with rvv 0.7.1, and a 128 bit vector length):
| switch: 0.19 Bytes/Cycle tbl: 0.17 Bytes/Cycle
| rvv: 1.57 Bytes/Cycle (dips down to 1.35 after ~30 KiB)
|
| Edit: you can go up to 2/1.7 Bytes/Cycle, if you make sure the
| pointer is page aligned (and vl isn't larger than the page size),
| see comments
| dzaima wrote:
| To be fully correct, you'd need the load to be a fault-only-
| first load (which rvv does have), otherwise that could fail if
| the null byte was just before the end of allocated memory.
| camel-cdr wrote:
| I'm not sure I fully understand fault-only-first load, but
| reading the description of vle8ff.v I think I only need to
| exchange the load inside of the loop?
|
| How does the normal load deal with faults?
|
| I'll update the parent comment, it slowed down the speed from
| 2/1.7 to 1.57/1.36 Bytes/Cycle.
| dzaima wrote:
| You'd probably want to have a new __riscv_vsetvlmax_e8m8 at
| the start of each loop iteration, as otherwise an earlier
| iteration could cut off the vl (e.g. page unloaded by the
| OS), and thus the loop continues with the truncated vl.
|
| The normal load should just segfault if any loaded byte is
| outside of readable memory, same as with a scalar load
| which is similarly partly outside.
| camel-cdr wrote:
| > You'd probably want to have a new
| __riscv_vsetvlmax_e8m8 at the start of each loop
| iteration, as otherwise an earlier iteration could cut
| off the vl (e.g. page unloaded by the OS), and thus the
| loop continues with the truncated vl.
|
| Oh, yeah, that was a big oversight, unfortunately, this
| didn't undo the performance regression.
|
| > The normal load should just segfault if any loaded byte
| is outside of readable memory, same as with a scalar load
| which is similarly partly outside.
|
| I don't quite understand how that plays out.
|
| The reference memcpy implementation uses `vle8.v` and the
| reference strlen implementation uses `vle8ff.v`.
|
| I think I understand how it works in strlen, but why does
| memcpy work without the ff? Does it just skip the
| instruction, or repeat it? Because in either case,
| shouldn't `vle8.v` work with strlen as well? There must
| be another option, but I can't think of any.
|
| Also, does this mean I can get the original performance
| back, if I make sure to page align my pointers and use
| `vle8.v`?
| dzaima wrote:
| The memcpy doesn't use a vlmax, it uses a hand-chosen vl.
| The load won't fault on any elements not loaded (here,
| elements past the vl), so the memcpy is fine as it only
| loads items it'll definitely need, whereas your original
| code can read elements past the null byte.
|
| And yeah, aligning the pointer manually would work
| (though then it wouldn't be portable code, as the spec
| does allow for rvv implementations with VLEN of up to
| 65536 (8KB per register; 64KB with LMUL=8), which'll be
| larger than the regular 4KB pages).
| camel-cdr wrote:
| Ah, this makes a lot more sense now. I thought the
| "fault" was about the kernel interrupting when a new page
| needs to be loaded into physical memory, which would also
| happen for memcpy.
| camel-cdr wrote:
| I just found your rvv intrinsics-viewer [0], that'll be so
| helpful.
|
| I tried building one, my self, but my miserable web skills
| didn't allow me to lazily load the instructions, which made
| it too slow for actual use.
|
| Can I share your project on lemmy?
|
| [0] https://dzaima.github.io/intrinsics-viewer
| dzaima wrote:
| Go ahead! I'm not much of a web dev either, but decided to
| struggle through it to, mainly, just have better searching.
| (originally for intel & ARM intrinsics, which are also
| available if downloaded offline)
| eklitzke wrote:
| Rearranging branches (and perhaps blocks too?) will definitely be
| done if you are building using FDO, because without FDO (or PGO)
| the compiler has no idea how likely each branch is to be taken.
| Cmov can also be enabled by FDO in some cases.
|
| However, whether or not using cmov is effective compared to a
| regular test/jump is highly dependent on how predictable the
| branch is, with cmov typically performing better when the branch
| is very unpredictable. Since they got a 6x speedup with cmov, I
| assume that their test input (which isn't described in the post,
| and is also not in their GitHub repo) consists of random strings
| consisting almost entirely of s and p characters. There's nothing
| wrong with this, but it does make the post seem a little
| misleading to me, as their clever speedup is mostly about
| exploiting an unmentioned property of the data that is highly
| specific to their benchmark.
| 414owen wrote:
| > because without FDO (or PGO) the compiler has no idea how
| likely each branch is to be taken
|
| So, the maximum amount of times you can hit '\0' is once in the
| string, because then the function returns, but you can hit the
| other characters many times, which seems to be information a
| compiler has access to without PGO.
|
| PGO does help, of course, and on my machine gives me 2.80s,
| which is better than the code at the end of the `Rearranging
| blocks` section :)
|
| > I assume that their test input (which isn't described in the
| post, and is also not in their GitHub repo)
|
| It's described under `Benchmarking setup`, and is in the
| repository here: https://github.com/414owen/blog-
| code/blob/master/01-six-time...
|
| Side note: There's a part two to this post (linked at the
| bottom) where I make the C code as fast as I possibly can, and
| it beats all the assembly in this post.
|
| I never said writing assembly is (necessarily) a good idea, I
| just find optimizing it, and deciphering compiler output, an
| interesting challenge, and a good learning opportunity.
| nwallin wrote:
| > I assume that their test input (which isn't described in the
| post, and is also not in their GitHub repo) consists of random
| strings consisting almost entirely of s and p characters.
|
| test code is here: https://github.com/414owen/blog-
| code/blob/master/02-the-same... it randomly selects between 's'
| or 'p'. The characters can't be anything other than 's', 'p',
| or the terminating null. Knowing that particular fact about our
| input gives us this ...clever... optimization:
| int run_switches(const char* s) { int result = 0;
| while (*s) result += (1 | *s++) - 'r';
| return result; }
|
| which compiles to: run_switches:
| movzx eax, BYTE PTR [rdi] xor edx, edx
| test al, al je .L1 .L3:
| or eax, 1 inc rdi
| movsx eax, al lea edx, [rdx-114+rax]
| movzx eax, BYTE PTR [rdi] test al, al
| jne .L3 .L1: mov eax, edx
| ret
|
| This is too clever by half, of course, but it perfectly
| illustrates your point about exploiting properties of the data.
___________________________________________________________________
(page generated 2023-07-06 23:00 UTC)