[HN Gopher] Six times faster than C
       ___________________________________________________________________
        
       Six times faster than C
        
       Author : 414owen
       Score  : 115 points
       Date   : 2023-07-06 16:20 UTC (6 hours ago)
        
 (HTM) web link (owen.cafe)
 (TXT) w3m dump (owen.cafe)
        
       | anthony88 wrote:
       | [flagged]
        
         | [deleted]
        
       | gavinray wrote:
       | Fantastic post, I appreciated that the ASM was displayed in tabs
       | as both "standard" and "visual-arrows"-annotated.
       | 
       | Kept me reading into the follow-up article.
       | 
       | Also, I love the UI of this blog.
        
       | RobotToaster wrote:
       | Was the C compiled with optimisation enabled?
        
         | 414owen wrote:
         | Yes, I explained in the `Benchmarking setup` section that I
         | used `march=native`, but I guess I forgot to mention I used
         | -O3.
        
       | sltkr wrote:
       | How much faster is this:                   int run_switches(const
       | char *buf) {            size_t len = strlen(buf);            int
       | res = 0;            for (int i = 0; i < len; ++i) {
       | res += (buf[i] == 's') - (buf[i] == 'p');            }
       | return res;         }
       | 
       | strlen() should be implemented in a pretty fast way, and after
       | the buffer size is known, the compiler can autovectorize the
       | inner loop, which does happen in practice:
       | https://gcc.godbolt.org/z/qYfadPYoq
        
       | xoranth wrote:
       | I think I managed to improve on both this post, and its sequel,
       | at the cost of specializing the function for the case of a string
       | made only of 's' and 'p'.
       | 
       | The benchmark only tests strings made of 's' and 'p', so I think
       | it is fair.
       | 
       | The idea is as follow. We want to increase `res` by one when the
       | next character is `s`. Naively, we might try something like this:
       | res += (c - 'r');  // is `res += 1` when c == 's'
       | 
       | This doesn't work, as `'p' - 'r' == -2`, and we'd need it to be
       | -1.
       | 
       | But `'p' - 'r'`, when viewer as an unsigned integer, underflows,
       | setting the carry flag. Turns out x64 has an instruction (adc)
       | that adds two registers _plus_ the carry flag.
       | 
       | Therefore we can replace two `cmp, cmov` with one `sub, adc`:
       | run_switches:                 xor    eax, eax            # res =
       | 0         loop:                 movsx  ecx, byte ptr [rdi]
       | test   ecx, ecx                 je     ret                 inc
       | rdi                 sub    ecx, 'r'                 adc    eax,
       | ecx     # Magic happens here                 jmp    loop
       | ret:                 ret
       | 
       | Benchmarks are as follows (`bench-x64-8` is the asm above):
       | Summary           '01-six-times-faster-than-c/bench-x64-8 1000 1'
       | ran             1.08 +- 0.00 times faster than '02-the-same-
       | speed-as-c/bench-c-4-clang 1000 1'             1.66 +- 0.00 times
       | faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'
       | 
       | Of course, one could improve things further using SWAR/SIMD...
        
         | 414owen wrote:
         | Very interesting approach. I should probably have specified
         | that the somewhat naive assembly in `02-the-same-speed-
         | as-c/loop-5.x64.s` is the fastest version I have.
         | 
         | On my machine I'm getting 0.244s for `loop-5.x64.s` and 0.422s
         | for your implementation above.
         | 
         | I'm not sure why exactly we're seeing this discrepancy, and for
         | what it's worth your implementation looks faster to me. I guess
         | this is why you need to always benchmark on the hardware you're
         | going to be running the code on...
        
           | xoranth wrote:
           | I rerun the benchmark vs loop-5 and loop-7 from the second
           | post. Runtime is basically the same on my machine.
           | 
           | I would have expected yours to be faster given that it needs
           | to execute fewer instructions per loop iteration. Though
           | maybe the CPU can run `adc` on more ports compared to a load
           | from memory?                   Summary           '01-six-
           | times-faster-than-c/bench-x64-8 1000 1' ran             1.00
           | +- 0.00 times faster than '02-the-same-speed-as-c/bench-x64-7
           | 1000 1'             1.66 +- 0.00 times faster than '01-six-
           | times-faster-than-c/bench-x64-7 1000 1'              Summary
           | '01-six-times-faster-than-c/bench-x64-8 1000 1' ran
           | 1.01 +- 0.00 times faster than '02-the-same-speed-
           | as-c/bench-x64-5 1000 1'             1.66 +- 0.00 times
           | faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'
        
       | torstenvl wrote:
       | I'm not so sure that the right take-away is "hand-written
       | assembler is 6x faster than C." It's more like "jumps are a lot
       | slower than conditional arithmetic." And that can [edit:often] be
       | achieved easily in C by simply not using switch statements when
       | an if statement or two will work fine.
       | 
       | Rewriting the C function as follows got a 5.5x speedup:
       | int run_switches(char *input) {             int r = 0;
       | char c;              while (1) {                 c = *input++;
       | if (c == 's') r++;                 if (c == 'p') r--;
       | if (c == '\0') break;             }             return r;
       | }
       | 
       | Results:                   [16:50:14 user@boxer ~/looptest] $ gcc
       | -O3 bench.c loop1.c -o lone         [16:50:37 user@boxer
       | ~/looptest] $ gcc -O3 bench.c loop2.c -o ltwo         [16:50:47
       | user@boxer ~/looptest] $ time ./lone 1000 1         449000
       | ./lone 1000 1  3.58s user 0.00s system 99% cpu 3.589 total
       | [16:50:57 user@boxer ~/looptest] $ time ./ltwo 1000 1
       | 449000         ./ltwo 1000 1  0.65s user 0.00s system 99% cpu
       | 0.658 total
        
         | BoppreH wrote:
         | What version of GCC are you using? For me both versions perform
         | the same, both on Ubuntu and Windows:                   $ time
         | ./lone 1000 1             851000                  real
         | 0m3.578s             user    0m3.574s             sys
         | 0m0.004s                      $ time ./ltwo 1000 1
         | 851000                  real    0m3.583s             user
         | 0m3.583s             sys     0m0.000s              $ gcc
         | --version             gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
         | Copyright (C) 2019 Free Software Foundation, Inc.
         | This is free software; see the source for copying conditions.
         | There is NO             warranty; not even for MERCHANTABILITY
         | or FITNESS FOR A PARTICULAR PURPOSE.
        
           | torstenvl wrote:
           | Sorry, I write 'gcc' purely out of force of habit. I'm using
           | Clang/LLVM.                   [17:23:00 user@boxer
           | ~/looptest] $ uname -a         Darwin boxer.local 21.6.0
           | Darwin Kernel Version 21.6.0: Thu Jun  8 23:57:12 PDT 2023;
           | root:xnu-8020.240.18.701.6~1/RELEASE_X86_64 x86_64
           | [17:23:47 user@boxer ~/looptest] $ cc -v         Apple clang
           | version 14.0.0 (clang-1400.0.29.202)         Target:
           | x86_64-apple-darwin21.6.0         Thread model: posix
           | InstalledDir: /Library/Developer/CommandLineTools/usr/bin
           | 
           | Clang generates the sete instruction for me with the above
           | code:                   [17:23:49 user@boxer ~/looptest] $
           | gcc -c -O3 loop2.c            [17:25:00 user@boxer
           | ~/looptest] $ objdump -d --symbolize-operands --x86-asm-
           | syntax=intel --no-show-raw-insn loop2.o
           | loop2.o: file format mach-o 64-bit x86-64
           | Disassembly of section __TEXT,__text:
           | 0000000000000000 <_run_switches>:                0:
           | push rbp                1:       mov rbp, rsp
           | 4:       xor eax, eax                6:       nop word ptr
           | cs:[rax + rax]         <L0>:               10:       movzx
           | ecx, byte ptr [rdi]               13:       add rdi, 1
           | 17:       xor edx, edx               19:       cmp cl, 115
           | 1c:       sete dl               1f:       add eax, edx
           | 21:       xor edx, edx               23:       cmp cl, 112
           | 26:       sete dl               29:       sub eax, edx
           | 2b:       test cl, cl               2d:       jne  <L0>
           | 2f:       pop rbp               30:       ret
        
         | 414owen wrote:
         | Nice! There's a part two in which I rewrote the C. I got a 12x
         | speedup :)
         | 
         | https://owen.cafe/posts/the-same-speed-as-c/
         | 
         | And as others have pointed out, you can tweak the input, then
         | vectorize the algo, if you want to go that route.
         | 
         | I considered this a pedagogical exercise and I sincerely hope
         | nobody will start dropping down to assembly without a very good
         | reason to.
        
         | haberman wrote:
         | > jumps are a lot slower than conditional arithmetic.
         | 
         | This statement is true _if_ the jumps are unpredictable. If the
         | jumps are predictable, then jumps will be faster.
         | 
         | Linus had a whole rant about this back in the day, arguing that
         | cmov is not useful if branches are predictable:
         | https://yarchive.net/comp/linux/cmov.html
        
           | torstenvl wrote:
           | I haven't run any benchmarks, but jump-if-equal and set-if-
           | equal would seem to have the same level of predictability.
           | 
           | My naive, untested intuition is that there's only one
           | meaningful difference: the former has to dump the entire
           | pipeline on a miss, and the latter only has to nop a single
           | instruction on a miss.
           | 
           | But maybe I'm missing something. I'll re-read his rant.
           | 
           | EDIT:
           | 
           | Linus rants a lot, but makes one concrete claim:
           | You can always replace it by                    j<negated
           | condition> forward           mov ..., %reg          forward:
           | and assuming the branch is AT ALL predictable (and 95+% of
           | all branches         are), *the branch-over will actually be
           | a LOT better for a CPU.*
           | 
           | So, I decided to test that.                   [18:50:14
           | user@boxer ~/src/looptest] $ diff -u loop2.s loop4.s
           | --- loop2.s 2023-07-06 18:40:11.000000000 -0400         +++
           | loop4.s 2023-07-06 18:46:58.000000000 -0400         @@ -17,11
           | +17,15 @@           incq %rdi           xorl %edx, %edx
           | cmpb $115, %cl         - sete %dl         + jne
           | _run_switches_jmptgt1         + mov $1,   %dl
           | +_run_switches_jmptgt1:             addl %edx, %eax
           | xorl %edx, %edx           cmpb $112, %cl         - sete %dl
           | + jne _run_switches_jmptgt2         + mov $1,   %dl
           | +_run_switches_jmptgt2:             subl %edx, %eax
           | testb %cl, %cl           jne LBB0_1         [18:50:29
           | user@boxer ~/src/looptest] $ gcc -O3 bench.c loop2.s -o l2
           | [18:50:57 user@boxer ~/src/looptest] $ gcc -O3 bench.c
           | loop4.s -o l4         [18:51:02 user@boxer ~/src/looptest] $
           | time ./l2 1000 1         449000         ./l2 1000 1  0.69s
           | user 0.00s system 99% cpu 0.697 total         [18:51:09
           | user@boxer ~/src/looptest] $ time ./l4 1000 1         449000
           | ./l4 1000 1  4.53s user 0.01s system 99% cpu 4.542 total
           | 
           | I feel pretty confident that Linus has made a poor prediction
           | about poor prediction here.
        
             | haberman wrote:
             | > jump-if-equal and set-if-equal would seem to have the
             | same level of predictability.
             | 
             | The difference is that branches have dedicated hardware
             | (branch predictors) that will speculatively execute
             | subsequent instructions based on their best guess about
             | which way the branch will go. Whereas conditional moves
             | cannot execute any subsequent instructions until the
             | correct value is available.
             | 
             | Put another way, CPUs have control flow speculation, but
             | not conditional move speculation. I don't know if
             | conditional move speculation would be a feasible thing to
             | implement or not, but I'm pretty sure that no mainstream
             | CPUs have such a feature.
        
         | DeathArrow wrote:
         | Shouldn't the compiler be able to do that, too?
        
           | ModernMech wrote:
           | Yes, there's always the "sufficiently smart compiler" that
           | can generate this code. Question is, does that compiler
           | exist?
        
             | Groxx wrote:
             | I sure hope so. The semantics are trivially identical, the
             | optimizations should be as well, by default - they should
             | depend on semantics, not syntax. And GCC in another comment
             | under this thread seems to be doing similar or identical
             | optimizations in both cases.
             | 
             | I wholly admit that this implies nothing about _all_
             | optimizers. But it 's a pretty reasonable one to expect.
        
             | charcircuit wrote:
             | >does that compiler exist?
             | 
             | and if so are the compile times worth it
        
           | thomasahle wrote:
           | I tried https://godbolt.org/, and neither Clang nor GCC trunk
           | give the same code for the two programs.
           | 
           | Pretty shocking for such a simple program.
        
         | p1necone wrote:
         | Is rewriting switch statements to a bunch of ifs _always_
         | faster? Or is there some number of cases where the switch is
         | faster? Seems like it should be added as a compiler
         | optimization if it 's consistent.
        
       | torstenvl wrote:
       | There's an error in the pseudocode.                     cmp
       | ecx, 's'            #   if (c == 's')           jne     loop
       | #     continue           add     eax, 1              #   res++
       | jmp     loop                #   continue
       | 
       | should be                     cmp     ecx, 's'            #   if
       | (c != 's')           jne     loop                #     continue
       | add     eax, 1              #   res++           jmp     loop
       | #   continue
        
         | agumonkey wrote:
         | I believe the first `jne` should be `je`, right ?
        
           | torstenvl wrote:
           | No, the assembler is correct. Jump (early) back to the
           | beginning of the loop if not equal to s; otherwise, continue
           | executing the next instruction (add eax, 1) and then
           | unconditionally jump back to the beginning of the loop.
        
         | [deleted]
        
       | BoppreH wrote:
       | You can also use math to avoid most of the jumps:
       | int run_switches(char *input) {           int res = 0;
       | while (true) {             char c = *input++;             if (c
       | == '\0') return res;             // Here's the trick:
       | res += (c == 's') - (c == 'p');           }         }
       | 
       | This gives a 3.7x speed compared to loop-1.c. The lower line
       | count is also nice.
        
         | svachalek wrote:
         | Nice. The way I read the cmove version, it's more or less this
         | except the trick line goes                   res += (c == 's')
         | ? 1 : (c == 'p') ? -1 : 0
         | 
         | I haven't done C in decades so I don't trust myself to
         | performance test this but I'm curious how it compares. Pretty
         | disappointed that TFA didn't go back and try that in C.
        
       | 414owen wrote:
       | A clickbait title for an in-depth look at hand-optimizing a very
       | simple loop.
        
         | ftxbro wrote:
         | I'm not a compiler expert but if it's a "very simple loop" is
         | it still too complex for the compiler to make good machine
         | code? Did they use a bad compiler on purpose? Or are computers
         | just not yet fast enough to do a good job with very simple
         | loops in practical compilers?
        
           | twoodfin wrote:
           | This is the right answer:
           | 
           | https://news.ycombinator.com/item?id=36622584
           | 
           | Optimal assembly (forgoing SIMD, at least) for this loop on
           | modern x86 is highly dependent on the entropy of the runtime
           | data.
        
             | ftxbro wrote:
             | OK so they were abusing the benchmark, like the compiler's
             | output would be faster on less contrived test data? Do I
             | have to search what are fdo or pgo or cmov to understand
             | the answer?
        
               | tylerhou wrote:
               | The compiler will generate different code if it knew the
               | rates at which branches were taken.
               | 
               | If a branch is almost always taken or almost never taken,
               | a compiler will want to emit a jump. The frontend will be
               | able to predict the jump with high probability, and a
               | successfully-predicted jump is "free." The cost of a
               | misprediction is paid for by the near-zero cost of the
               | many successful predictions.
               | 
               | If a branch is hard to predict (and taking versus not
               | taking it would load a different value into a
               | register/memory), the compiler wants to emit a
               | conditional move (cmov). A conditional move is slightly
               | "more expensive" in the backend because the CPU has to
               | wait for the condition to resolve before it can execute
               | instructions dependent on the output. However, it is much
               | cheaper than many mispredicted branches (mispredicts
               | around half of the time).
               | 
               | FDO (feedback-directed optimization) or PGO (profile-
               | guided optimization) means "run the code on some sample
               | input and profile how often branches are taken/not
               | taken." It gives the compiler more information to
               | generate better code.
               | 
               | The problem with the blog post is that the compiler has
               | no idea what the function's input data will look like. It
               | (arbitrarily) chose to generate branches instead of
               | cmovs. However, if the benchmark input is better suited
               | for cmovs, then the benchmark will (wrongly) show that
               | the compiler generates "slow" assembly. But that's not a
               | fair test, because with PGO/FDO the compiler would
               | generate equivalent assembly to the "fast" assembly
               | (actually, probably faster). Finally, the human (OP) is
               | using their knowledge of the benchmark data "unfairly" to
               | write better assembly than the compiler.
               | 
               | The takeaway is: most of the time, one can't optimize
               | code/assembly in a vacuum. You also need to know what the
               | input data and access patterns look like. FDO/PGO gives
               | the compiler more data to understand what the input
               | data/access patterns look like.
        
               | ftxbro wrote:
               | Thank you this is an amazingly comprehensive answer! Now
               | I wonder what would be the workflow for using these
               | compiler features. Like if I am a normal or bad C
               | programmer and I write my program and use valgrind to
               | check that it doesn't have obvious problems and I compile
               | it with -march native or whatever, then I can add some
               | step to the workflow to somehow re-compile it in a way
               | that uses the branching or access patterns of some
               | examples that I let it process for that purpose?
        
               | tylerhou wrote:
               | Yes, Clang supports FDO, but it might be hard to set up
               | (I've never set up FDO myself). You could check out
               | https://github.com/google/autofdo and
               | https://clang.llvm.org/docs/UsersManual.html#profile-
               | guided-....
               | 
               | (People within Google say "FDO", basically everyone else
               | says "PGO".)
        
           | cjensen wrote:
           | The problem is the author of the article is making some huge
           | implicit assumptions that the compiler can't possibly know
           | about.
           | 
           | Consider this statement: "However, we know some things about
           | this loop. We know that the only time we break out of it is
           | when we hit the null terminator ('\0'). The code clang
           | generates checks for the null terminator first, but this
           | makes no sense."
           | 
           | This statement contains huge assumptions about the lengths of
           | the input strings and the frequency of the letters 's' and
           | 'p' in the input. And then has the chutzpah to call the
           | compiler's failure to read his mind about this as "making no
           | sense."
           | 
           | Good first effort by the author, but has not sufficiently
           | thought through the problem.
        
           | moonchild wrote:
           | > are computers just not yet fast enough to do a good job
           | with very simple loops in practical compilers?
           | 
           | The short answer to this question is 'yes', but there are
           | some extenuating factors:
           | 
           | - Although we could do interesting things with unlimited
           | computational resources, the current crop of c compilers is
           | simply not very good, compared with what's possible today.
           | 
           | - Performance is always workload-dependent; the compiler has
           | been somewhat shafted here because it doesn't know what sorts
           | of inputs the function usually receives. The compiler output
           | is better than the 'improved' code for some inputs. (It's
           | possible you could get a better result from the existing
           | compilers and c code just by using profile-guided
           | optimisation.)
           | 
           | - The difference is prone to be more pronounced in simple
           | loops than large ones. This is a contrived use-case. There is
           | not a factor of 6 of performance hiding in optimised c code
           | which could be recovered by doing the sorts of optimisations
           | done by the op. Probably something more like 10-20%.
        
       | Const-me wrote:
       | I'm probably an optimization expert, and I would solve that
       | problem completely differently.
       | 
       | On my computer, the initial C version runs at 389 MB / second. I
       | haven't tested the assembly versions, but if they deliver the
       | same 6.2x speedup, would result in 2.4 GB/second here.
       | 
       | Here's C++ version which for long buffers exceeds 24 GB/second on
       | my computer: https://gist.github.com/Const-
       | me/3ade77faad47f0fbb0538965ae7... That's 61x speedup compared to
       | the original version, without any assembly, based on AVX2
       | intrinsics.
        
         | gavinray wrote:
         | Do you know if this is possible using "std::experimental::simd"
         | out of curiosity?
         | 
         | https://en.cppreference.com/w/cpp/experimental/simd
        
       | lukas099 wrote:
       | Would it be possible to write a code profiler and compiler that
       | work together to optimize code based on real-world data? The
       | profiler would output data that would feed back into the
       | compiler, telling it which branches were selected most often,
       | which would recompile optimizing for the profile. Would this even
       | work? Has it already been done?
        
       | failuser wrote:
       | Having a full-blown predicate support is so nice to have, but it
       | interferes with compact instruction encoding.
       | 
       | Such bloated ISA like x86 might actually handle predicate
       | support, but who will try such a radical change?
        
       | vardump wrote:
       | I think it's straightforward to optimize to a point it's maybe
       | about 10x faster than the "optimized" version. The answer is of
       | course SIMD vectorization.
        
       | nwallin wrote:
       | IMHO the original code wasn't written in a way that's
       | particularly friendly to compilers. If you write it like this:
       | int run_switches_branchless(const char* s) {             int
       | result = 0;             for (; *s; ++s) {                 result
       | += *s == 's';                 result -= *s == 'p';             }
       | return result;         }
       | 
       | ...the compiler will do all the branchless sete/cmov stuff as it
       | sees fit. It will be the same speed as the optimized assembly in
       | the post, +/- something insignificant. However it won't unroll
       | and vectorize the loop. If you write it like this:
       | int run_switches_vectorized(const char* s, size_t size) {
       | int result = 0;             for (; size--; ++s) {
       | result += *s == 's';                 result -= *s == 'p';
       | }             return result;         }
       | 
       | It will know the size of the loop, and will unroll it and use
       | AVX-512 instructions if they're available. This will be
       | substantially faster than the first loop for large inputs,
       | although I'm too lazy to benchmark just how much faster it is.
       | 
       | Now, this requires knowing the size of your string in advance,
       | and maybe you're the sort of C programmer who doesn't keep track
       | of how big your strings are. I'm not your coworker, I don't
       | review your code. Do what you want. But you really _really_
       | probably shouldn 't.
       | 
       | https://godbolt.org/z/rde51zMd8
        
         | jonny_eh wrote:
         | > But you really really probably shouldn't.
         | 
         | Shouldn't "not" keep track of string length?
        
           | darig wrote:
           | [dead]
        
           | nwallin wrote:
           | Err... yes. You shouldn't not keep track of string/buffer
           | sizes.
        
         | 414owen wrote:
         | The version that's friendly to the compiler is described in
         | part two: https://owen.cafe/posts/the-same-speed-as-c/
         | 
         | It achieves 3.88GiB/s
         | 
         | I intentionally didn't go down the route of vectorizing. I
         | wanted to keep the scope of the problem small, and show off the
         | assembly tips and tricks in the post, but maybe there's
         | potential for a future post, where I pad the input string and
         | vectorize the algorithm :)
        
       | aidenn0 wrote:
       | A while back, I wrote a UTF-8 decoder in Common Lisp, targeting
       | SBCL (it already has one built in, this was an exercise). Pretty
       | much all of the optimization win (after the obvious low-hanging
       | fruit) was structuring the code so that the compiler would
       | generate cmov* instructions rather than branches.
        
         | moonchild wrote:
         | Branches are prone to be faster than conditional moves if they
         | are correctly predicted, because they do not increase the
         | critical path length. And utf-8 decoders are commonly run on
         | all-ascii input. What were you benchmarking on?
        
           | aidenn0 wrote:
           | I ran separate benchmarks on all-ASCII, BMP-only, and ascii
           | with non-BMP. ASCII was _not_ slower on the low-branch
           | version.
        
         | whartung wrote:
         | What's some examples of the code changes that you made? And did
         | you just do repeated disassemblies of the functions to see that
         | it was using the correct instructions, or did you do some
         | benchmarking to show your changes were actual improvements?
        
           | aidenn0 wrote:
           | Gosh, I'd have to see if I can dig it up this was a few years
           | ago.
           | 
           | I did all of the above, plus profiling (sb-sprof combined
           | with disassemble will show assembly level profiling).
        
       | Sesse__ wrote:
       | This code screams for SIMD! If you can change the prototype to
       | take an explicit length, you could easily read and process 16
       | bytes at a time (the compares will give you values you can just
       | add and subtract directly). Heck, even calling strlen() at the
       | function's start to get the explicit length would probably be
       | worth it.
        
       | camel-cdr wrote:
       | I threw together a quick risc-v vectorized implementation:
       | size_t run(char *str) {                 uint8_t *p =
       | (uint8_t*)str;                 long end = 0;
       | size_t res = 0, vl;                 while (1) {
       | vl = __riscv_vsetvlmax_e8m8();                         vuint8m8_t
       | v = __riscv_vle8ff_v_u8m8(p, &vl, vl);
       | end = __riscv_vfirst_m_b1(__riscv_vmseq_vx_u8m8_b1(v, '\0', vl),
       | vl);                         if (end >= 0)
       | break;                         res +=
       | __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 's', vl), vl);
       | res -= __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 'p', vl),
       | vl);                         p += vl;                 }
       | vl = __riscv_vsetvl_e8m8(end);                 vuint8m8_t v =
       | __riscv_vle8_v_u8m8(p, vl);                 res +=
       | __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 's', vl), vl);
       | res -= __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 'p', vl),
       | vl);                 return res;         }
       | 
       | Here are the results from the above, the switch and the table c
       | implementation, ran on my mangopi mq pro (C906, in order rv64gc
       | with rvv 0.7.1, and a 128 bit vector length):
       | switch: 0.19 Bytes/Cycle         tbl:    0.17 Bytes/Cycle
       | rvv:    1.57 Bytes/Cycle (dips down to 1.35 after ~30 KiB)
       | 
       | Edit: you can go up to 2/1.7 Bytes/Cycle, if you make sure the
       | pointer is page aligned (and vl isn't larger than the page size),
       | see comments
        
         | dzaima wrote:
         | To be fully correct, you'd need the load to be a fault-only-
         | first load (which rvv does have), otherwise that could fail if
         | the null byte was just before the end of allocated memory.
        
           | camel-cdr wrote:
           | I'm not sure I fully understand fault-only-first load, but
           | reading the description of vle8ff.v I think I only need to
           | exchange the load inside of the loop?
           | 
           | How does the normal load deal with faults?
           | 
           | I'll update the parent comment, it slowed down the speed from
           | 2/1.7 to 1.57/1.36 Bytes/Cycle.
        
             | dzaima wrote:
             | You'd probably want to have a new __riscv_vsetvlmax_e8m8 at
             | the start of each loop iteration, as otherwise an earlier
             | iteration could cut off the vl (e.g. page unloaded by the
             | OS), and thus the loop continues with the truncated vl.
             | 
             | The normal load should just segfault if any loaded byte is
             | outside of readable memory, same as with a scalar load
             | which is similarly partly outside.
        
               | camel-cdr wrote:
               | > You'd probably want to have a new
               | __riscv_vsetvlmax_e8m8 at the start of each loop
               | iteration, as otherwise an earlier iteration could cut
               | off the vl (e.g. page unloaded by the OS), and thus the
               | loop continues with the truncated vl.
               | 
               | Oh, yeah, that was a big oversight, unfortunately, this
               | didn't undo the performance regression.
               | 
               | > The normal load should just segfault if any loaded byte
               | is outside of readable memory, same as with a scalar load
               | which is similarly partly outside.
               | 
               | I don't quite understand how that plays out.
               | 
               | The reference memcpy implementation uses `vle8.v` and the
               | reference strlen implementation uses `vle8ff.v`.
               | 
               | I think I understand how it works in strlen, but why does
               | memcpy work without the ff? Does it just skip the
               | instruction, or repeat it? Because in either case,
               | shouldn't `vle8.v` work with strlen as well? There must
               | be another option, but I can't think of any.
               | 
               | Also, does this mean I can get the original performance
               | back, if I make sure to page align my pointers and use
               | `vle8.v`?
        
               | dzaima wrote:
               | The memcpy doesn't use a vlmax, it uses a hand-chosen vl.
               | The load won't fault on any elements not loaded (here,
               | elements past the vl), so the memcpy is fine as it only
               | loads items it'll definitely need, whereas your original
               | code can read elements past the null byte.
               | 
               | And yeah, aligning the pointer manually would work
               | (though then it wouldn't be portable code, as the spec
               | does allow for rvv implementations with VLEN of up to
               | 65536 (8KB per register; 64KB with LMUL=8), which'll be
               | larger than the regular 4KB pages).
        
               | camel-cdr wrote:
               | Ah, this makes a lot more sense now. I thought the
               | "fault" was about the kernel interrupting when a new page
               | needs to be loaded into physical memory, which would also
               | happen for memcpy.
        
           | camel-cdr wrote:
           | I just found your rvv intrinsics-viewer [0], that'll be so
           | helpful.
           | 
           | I tried building one, my self, but my miserable web skills
           | didn't allow me to lazily load the instructions, which made
           | it too slow for actual use.
           | 
           | Can I share your project on lemmy?
           | 
           | [0] https://dzaima.github.io/intrinsics-viewer
        
             | dzaima wrote:
             | Go ahead! I'm not much of a web dev either, but decided to
             | struggle through it to, mainly, just have better searching.
             | (originally for intel & ARM intrinsics, which are also
             | available if downloaded offline)
        
       | eklitzke wrote:
       | Rearranging branches (and perhaps blocks too?) will definitely be
       | done if you are building using FDO, because without FDO (or PGO)
       | the compiler has no idea how likely each branch is to be taken.
       | Cmov can also be enabled by FDO in some cases.
       | 
       | However, whether or not using cmov is effective compared to a
       | regular test/jump is highly dependent on how predictable the
       | branch is, with cmov typically performing better when the branch
       | is very unpredictable. Since they got a 6x speedup with cmov, I
       | assume that their test input (which isn't described in the post,
       | and is also not in their GitHub repo) consists of random strings
       | consisting almost entirely of s and p characters. There's nothing
       | wrong with this, but it does make the post seem a little
       | misleading to me, as their clever speedup is mostly about
       | exploiting an unmentioned property of the data that is highly
       | specific to their benchmark.
        
         | 414owen wrote:
         | > because without FDO (or PGO) the compiler has no idea how
         | likely each branch is to be taken
         | 
         | So, the maximum amount of times you can hit '\0' is once in the
         | string, because then the function returns, but you can hit the
         | other characters many times, which seems to be information a
         | compiler has access to without PGO.
         | 
         | PGO does help, of course, and on my machine gives me 2.80s,
         | which is better than the code at the end of the `Rearranging
         | blocks` section :)
         | 
         | > I assume that their test input (which isn't described in the
         | post, and is also not in their GitHub repo)
         | 
         | It's described under `Benchmarking setup`, and is in the
         | repository here: https://github.com/414owen/blog-
         | code/blob/master/01-six-time...
         | 
         | Side note: There's a part two to this post (linked at the
         | bottom) where I make the C code as fast as I possibly can, and
         | it beats all the assembly in this post.
         | 
         | I never said writing assembly is (necessarily) a good idea, I
         | just find optimizing it, and deciphering compiler output, an
         | interesting challenge, and a good learning opportunity.
        
         | nwallin wrote:
         | > I assume that their test input (which isn't described in the
         | post, and is also not in their GitHub repo) consists of random
         | strings consisting almost entirely of s and p characters.
         | 
         | test code is here: https://github.com/414owen/blog-
         | code/blob/master/02-the-same... it randomly selects between 's'
         | or 'p'. The characters can't be anything other than 's', 'p',
         | or the terminating null. Knowing that particular fact about our
         | input gives us this ...clever... optimization:
         | int run_switches(const char* s) {             int result = 0;
         | while (*s)                 result += (1 | *s++) - 'r';
         | return result;         }
         | 
         | which compiles to:                   run_switches:
         | movzx   eax, BYTE PTR [rdi]                 xor     edx, edx
         | test    al, al                 je      .L1         .L3:
         | or      eax, 1                 inc     rdi
         | movsx   eax, al                 lea     edx, [rdx-114+rax]
         | movzx   eax, BYTE PTR [rdi]                 test    al, al
         | jne     .L3         .L1:                 mov     eax, edx
         | ret
         | 
         | This is too clever by half, of course, but it perfectly
         | illustrates your point about exploiting properties of the data.
        
       ___________________________________________________________________
       (page generated 2023-07-06 23:00 UTC)