[HN Gopher] Show HN: 10-40% faster LZMA decoder using x86 CMOVcc
       ___________________________________________________________________
        
       Show HN: 10-40% faster LZMA decoder using x86 CMOVcc
        
       Author : jpegqs
       Score  : 149 points
       Date   : 2021-12-13 15:29 UTC (7 hours ago)
        
 (HTM) web link (gist.github.com)
 (TXT) w3m dump (gist.github.com)
        
       | hasmanean wrote:
       | When did CMOVs become efficient again? The x86 optimization guide
       | used to warn people not to use them, around 2008.
        
         | ncmncm wrote:
         | Cmov is, and was, efficient whenever branch prediction is
         | unreliable.
         | 
         | Clang will turn eligible ?: expressions into cmov. Gcc will do
         | no more than one of those per basic block, subject to fragile
         | conditions. Gcc got its fingers badly burned by overuse of
         | cmov. [Edit: I am wrong! Gcc would not do it when I was trying.
         | More research needed.]
         | 
         | Generally, branches break dependency chains, enabling more
         | implicit speculative parallelism. Older cores stuck cmov ops
         | onto dependency chains, where more recently they are treated
         | more like branches. That gives you less speculative evaluation,
         | but consumes less state that would be discarded on a mis-
         | predicted branch.
        
         | 10000truths wrote:
         | https://www.agner.org/optimize/optimizing_assembly.pdf#page=...
         | 
         | TL;DR rule of thumb is that conditional jump is better than
         | CMOV if the code is part of a dependency chain and the
         | prediction rate is better than 75%.
        
         | wolf550e wrote:
         | CMOVs were terrible on Pentium 4 but ok on PPro, Pentium M,
         | Core, Core 2, Sandybridge, Skylake, etc.
         | 
         | Also ok on Atom and on AMD chips.
        
       | jpegqs wrote:
       | LZMA compression (used in 7-zip, XZ, LZMA2) is known as one of
       | the best, but it has a noticeable drawback - it's slow. I tried
       | to improve the decompression speed by removing excessive
       | branching at decoding of every bit from the compressed data.
       | 
       | Decompression speedup from this patch largely depends on the
       | compression ratio, more ratio - less speedup. Compressed text,
       | such as source code, gives the least speedup.
       | 
       | That's the result from my Skylake, compiled with GCC. Please help
       | me with testing on different x86 CPUs.
       | 
       | x86 (32-bit) - should work, but haven't tested yet. Compiled with
       | Clang should work as well.
        
         | jack_pp wrote:
         | I've recently watched this talk that may help you profile
         | better : https://youtu.be/r-TLSBdHe1A
        
       | viraptor wrote:
       | When I've seen CMOVcc, my mind jumped to the callcc name and now
       | I can't stop thinking in what absurd way could you create an
       | actual CMOV-with-current-continuation. Conditional-MOV-value-or-
       | EIP-otherwise seems too trivial.
        
       | injinj wrote:
       | Decent speedup on my 3970x. Including this patch reduced the
       | number of instructions in lzma_decoder.s (using gcc 8.3.1) by
       | about 8% (4417 lines of asm vs 4824). With perf stat, an
       | astonishing branches-missed reduction from 409K to 104K.
       | 
       | Using the firefox example from the gist:                 $ tar
       | -cJf lib.tar.xz /usr/lib64/firefox
       | 
       | The xz shipped from the system:                 $ perf stat xz -c
       | -d lib.tar.xz > /dev/null
       | Performance counter stats for 'xz -c -d lib.tar.xz':
       | 4,650.32 msec task-clock:u              #    1.000 CPUs utilized
       | 0      context-switches:u        #    0.000 K/sec
       | 0      cpu-migrations:u          #    0.000 K/sec
       | 591      page-faults:u             #    0.127 K/sec
       | 19,849,912,300      cycles:u                  #    4.269 GHz
       | (83.33%)            425,290,878      stalled-cycles-frontend:u #
       | 2.14% frontend cycles idle     (83.33%)          1,831,640,390
       | stalled-cycles-backend:u  #    9.23% backend cycles idle
       | (83.34%)         23,973,036,103      instructions:u            #
       | 1.21  insn per cycle
       | #    0.08  stalled cycles per insn  (83.33%)
       | 2,939,144,233      branches:u                #  632.031 M/sec
       | (83.34%)            409,371,860      branch-misses:u           #
       | 13.93% of all branches          (83.33%)
       | 4.650679926 seconds time elapsed                 4.611657000
       | seconds user            0.011931000 seconds sys
       | 
       | The xz patched.                 $ git clone
       | http://git.tukaani.org/xz.git       $ cd xz/src       $ patch -l
       | -p1 < ../faster_lxma_decoder_x86.patch       $ cd .. ; autogen.sh
       | && configure && make       $
       | LD_PRELOAD=./liblzma/.libs/liblzma.so       $ perf stat
       | ./xz/.libs/xz -c -d ../../lib.tar.xz > /dev/null
       | Performance counter stats for './xz/.libs/xz -c -d
       | ../../lib.tar.xz':                    3,578.54 msec task-clock:u
       | #    1.000 CPUs utilized                                0
       | context-switches:u        #    0.000 K/sec
       | 0      cpu-migrations:u          #    0.000 K/sec
       | 593      page-faults:u             #    0.166 K/sec
       | 15,186,685,715      cycles:u                  #    4.244 GHz
       | (83.32%)            108,663,507      stalled-cycles-frontend:u #
       | 0.72% frontend cycles idle     (83.32%)          8,753,057,119
       | stalled-cycles-backend:u  #   57.64% backend cycles idle
       | (83.34%)         27,322,182,837      instructions:u            #
       | 1.80  insn per cycle
       | #    0.32  stalled cycles per insn  (83.35%)
       | 1,979,944,734      branches:u                #  553.282 M/sec
       | (83.34%)            104,752,154      branch-misses:u           #
       | 5.29% of all branches          (83.34%)
       | 3.578973194 seconds time elapsed                 3.549329000
       | seconds user            0.011942000 seconds sys
        
       | CyberShadow wrote:
       | Would using compiler intrinsics instead of raw assembler have any
       | benefits here, such as being applicable to more architectures?
        
         | dragontamer wrote:
         | Well, ideally, compilers should be compiling into CMOV vs
         | Branching decisions much better.
         | 
         | I'm shocked, but not that shocked
         | (https://tenor.com/view/shocked-gif-5787388), that compilers
         | today still are weaker than a dedicated raw-assembly programmer
         | in these kinds of microarchitectural decisions. Even with what
         | should be normal 64-bit code with compilers that have very good
         | modeling of throughputs / latencies per instruction.
         | 
         | ---------
         | 
         | I'm now more curious as to which compilers can turn the raw
         | C-code into a cmov and which compilers turn the code into the
         | less efficient form (I assume branching??)
        
           | CyberShadow wrote:
           | Using CMOV vs. branching seems like a strategic decision
           | which does make sense to fall under the programmer's decision
           | making, as there is a trade-off of always calculating a value
           | which may be unused vs. the cost of the conditional branch.
        
             | danachow wrote:
             | > as there is a trade-off of always calculating a value
             | which may be unused vs. the cost of the conditional branch.
             | 
             | I don't really understand the point you're trying to make.
             | Figuring out if a value is unused is most definitely the
             | purview of an optimizer. Also, the "calculating a value"
             | isn't really the trade off being made between cmov and
             | branching.
        
               | CyberShadow wrote:
               | > I don't really understand the point you're trying to
               | make. Figuring out if a value is unused is most
               | definitely the purview of an optimizer.
               | 
               | If the conditional move doesn't happen, then the source
               | (insofar as the move is concerned) is unused.
               | 
               | Consider this pseudocode:                   int value =
               | some_nontrivial_function_with_no_side_effects();
               | if (condition)             *target = value;
               | 
               | Note that the function can be as simple as a memory read.
               | 
               | The compiler could compile this in two ways:
               | 
               | 1. Observing that the function's result is used only if
               | condition is true, move the function call inside the if
               | block.
               | 
               | 2. Always call the function, as in the source code, but
               | compile the if block to a conditional move.
               | 
               | In such situations, it would make sense to allow
               | programmers to indicate the desired strategy to the
               | compiler.
               | 
               | I suppose CPUs might elide calculating the value even
               | with a conditional move if they can predict the condition
               | is [likely to be] false; I don't know how true that is in
               | practice.
               | 
               | > Also, the "calculating a value" isn't really the trade
               | off being made between cmov and branching.
               | 
               | Depending on the situation and interpretation of terms, I
               | also agree.
        
           | brigade wrote:
           | gcc and clang see the use of ternaries as a hint that they
           | should use cmov, but it's obviously not guaranteed.
        
             | dragontamer wrote:
             | I'm trying to grok the performance improvement here.
             | 
             | The diff points to:                   + /* tmp = rc.code <
             | rc_bound ? rc.range = rc_bound, ~0 : 0; */ \         +
             | __asm__ ( \         +  "cmpl %3, %2\n\t" \         +
             | "cmovbl %3, %1\n\t" \         +  "sbbl %0, %0" \         +
             | : "=&r"(tmp), "+&r"(rc.range) \         +  : "r"(rc.code),
             | "r"(rc_bound) \         + ); \
             | 
             | This is the only use of assembly in this entire diff. That
             | "tmp = rc.code < rc_bound ?..." statement looks like it'd
             | probably be a cmov, at least I'd expect it to compile to
             | cmov without much issue.
             | 
             | However, this is some advanced "carry-flag manipulation"
             | stuff going on here. I'm not sure if its the "cmov" per se
             | that was advanced, as much as the sbbl statement (subtract
             | borrow, the subtraction-analog to adc). sub %0, %0 is
             | obviously "zero", but sbbl %0, %0 is "0xFFFFFFFF" if carry
             | is 1.
             | 
             | Its certainly a very well thought out bit of manual
             | assembly language. The sbb is probably more important than
             | the cmov (in that the compiler probably emits the cmov, but
             | may not see the sbb????)
             | 
             | ----------
             | 
             | The original code seems to be: https://github.com/COMBINE-
             | lab/xz/blob/master/src/liblzma/ra...
             | #define rc_direct(dest, seq) \         do { \
             | rc_normalize(seq); \          rc.range >>= 1; \
             | rc.code -= rc.range; \          rc_bound = UINT32_C(0) -
             | (rc.code >> 31); \          rc.code += rc.range & rc_bound;
             | \          dest = (dest << 1) + (rc_bound + 1); \         }
             | while (0)
             | 
             | I don't know what its doing, but the cmov stuff is very,
             | very different entirely. There doesn't seem to be a branch
             | involved at all in this "rc_direct" inner-loop.
             | 
             | Its not very clear to me how they saw this sequence of C,
             | and the decided upon a cmov / sbb approach to do this
             | equivalent work. Its clearly some kind of advanced thinking
             | that no compiler would have gotten.
        
               | brigade wrote:
               | It's replacing rc_bit not rc_direct (rc_direct is a fixed
               | .5 probability), which just uses rc_bit_last, so original
               | code is:                   #define rc_bit_last(prob,
               | action0, action1, seq) \         do { \
               | rc_if_0(prob, seq) { \           rc_update_0(prob); \
               | action0; \          } else { \
               | rc_update_1(prob); \           action1; \          } \
               | } while (0)
               | 
               | So it's merging the two sides of rc_update and cmov/sbb
               | handle the difference. action0/action1 are generally
               | blank, but rc_bit_matched makes the common action
               | branchless as well.
        
               | dragontamer wrote:
               | Thanks for the info.
               | 
               | Realizing that the two sides of the "if" statement are
               | effectively stored as a 1-bit of information, and then
               | using instructions to store that 1-bit into the carry
               | flag, and then using cmovcc / sbbl instructions to
               | manipulate that 1-bit of information is pretty advanced.
               | 
               | I could imagine a compiler making use of all that
               | information, but only maybe with some programmer
               | assistance (IE: code in the right format for easier
               | optimizing). Or you know... an assembly programmer who
               | just outright explicitly writes that kind of assembly
               | language.
        
               | brigade wrote:
               | Yeah, this sort of conversion is pretty common for the
               | core of arithmetic/range decoders, and I definitely don't
               | expect a compiler to be capable of the sort of
               | transformations needed to convert the branchy form to
               | branchless.
               | 
               | Though I'd expect compilers to be at least 90% as good
               | with the ternary equivalent to the asm, e.g.:
               | tmp = code < bound ? ~0 : 0;         range = code < bound
               | ? bound : range;
               | 
               | compiles tmp to xor -> setl -> neg (clang) or setge ->
               | movzx -> sub (gcc) instead of sbb, but on arm64 they use
               | just one csetm (clang) or csinv (gcc) which is basically
               | optimal, and in all cases the comparison is reused for
               | the cmov.
               | 
               | https://godbolt.org/z/jPbo3qc1Y
        
           | ncmncm wrote:
           | Correctly-predicted branching is better; but if the branch
           | would have been mis-predicted, cmov is better. The compiler
           | has no idea which will happen at runtime.
           | 
           | Gcc will never under any circumstances generate two cmov
           | instructions in a basic block, even when you definitely want
           | that. Clang will. [Edit: I am wrong. Gcc can do more than one
           | cmov in a basic block. Just not when I was trying it!]
           | 
           | In principle, profile-guided optimization could help, if the
           | instrumented run is representative of production behavior. In
           | my tests it did not affect the cmov/branch choice. Neither
           | did the "% expected" intrinsic. Either of those could change
           | in any release.
           | 
           | The Zstd and Lz4 decoders make very effective use of cmov.
           | They should be your model.
        
             | mmozeiko wrote:
             | Here are two cmov's with gcc:
             | https://godbolt.org/z/j9bcv639c
        
               | ncmncm wrote:
               | Thank you. It appears Gcc won't put two cmov _writes to
               | memory_ in a block. Thus,                 void
               | swap_if(bool c, int& a, int& b) {         int ta = a, tb
               | = b;         a = c ? tb : ta;         b = c ? ta : tb;
               | }
               | 
               | is very slow, under Gcc, when c is poorly predicted, as
               | is typical when e.g. partitioning for quicksort. But how
               | well it will be predicted depends on input data.
               | 
               | [0] https://godbolt.org/z/j5W9dMjYE
        
               | mmozeiko wrote:
               | To me it looks like something related to some other
               | optimization pass (I don't know much about gcc passes).
               | But not related to writes to memory. Here are two writes
               | both using cmov (on different code):
               | https://godbolt.org/z/n3dTrPo6e
               | 
               | Edit: compiling your code without modifications, but with
               | `-Os` also gives two cmov's:
               | https://godbolt.org/z/r86azb7be
        
               | tssva wrote:
               | I'm not a programmer by profession and barely one by
               | hobby. I know nothing about compiler internals but
               | looking at the definition of a basic block from
               | Wikipedia, "In compiler construction, a basic block is a
               | straight-line code sequence with no branches in except to
               | the entry and no branches out except at the exit." it
               | seems that the ternary operators in "return (x ? a : b) +
               | (y ? c : d);" although part of the same programming block
               | would be distinct compiler basic blocks.
        
               | dragontamer wrote:
               | Basic blocks are how compilers "conceptualize" our code
               | and start thinking about optimization.
               | 
               | Frankly, my recommendation to you is to ignore that
               | stuff. You need like 3 or 4 years of school to reach that
               | level. You should have studied easier graph optimization
               | problems (ex: Finite Automatia) first, before dealing
               | with code-optimization.
               | 
               | The school curriculum is typically Programming 101 ->
               | Data Structures -> Algorithms -> Finite Automata (aka:
               | Regex) -> Pushdown Automata (aka: context-free grammars)
               | -> Turing Machines (aka: general purpose code) ->
               | Compilers (which use Finite Automata, Pushdown Automata,
               | and Turing Machines simultaneously).
               | 
               | Basic Blocks are the graph structure that compilers use
               | to think about code. Its a lot of graph-theory built up
               | on a lot of language theory.
               | 
               | ------
               | 
               | That being said: if you're actually interested in this
               | stuff, then feel free to explore. But just beware, you've
               | stumbled upon a very complex subject that is easily 4th
               | year undergrad or even graduate-school level.
               | 
               | With enough effort, you'll understand things. But this is
               | no subject for beginners to go around exploring by
               | themselves.
               | 
               | -------
               | 
               | That being said, I want to encourage you to explore the
               | subject anyway. Just explore the subject with awareness
               | that there's a lot to understand here.
               | 
               | If you want to skip the 3ish years of elementary data-
               | structures / comp. sci theory... I suggest starting with
               | Static Single Assignment and working forward from here.
               | 
               | https://en.wikipedia.org/wiki/Static_single_assignment_fo
               | rm
        
       | jeffbee wrote:
       | If you're going for decoding speed wouldn't you be using brotli
       | anyway? At a given compression ratio brotli is ~4x faster on
       | decode path, and they are similar on the encode path.
       | 
       | Your corpus may vary etc etc.
        
         | orev wrote:
         | Contrary to popular belief (sadly, even among the tech crowd
         | here), people use computers for more than just web stuff.
        
           | skyde wrote:
           | Brotli is not just for Web stuff.
        
             | duskwuff wrote:
             | Brotli is heavily tuned for "web stuff". It's literally got
             | a built-in dictionary full of commonly used HTML fragments:
             | 
             | https://gist.github.com/klauspost/2900d5ba6f9b65d69c8e
        
               | sebazzz wrote:
               | Shoot me, works well for the average desktop (=Electron)
               | application then.
        
               | adgjlsfhk1 wrote:
               | Electron is just web stuff.
        
             | orev wrote:
             | Sure. In theory. But every other response to this comment
             | are clearly talking about web stuff, and in practice brotli
             | is really only used for web stuff.
        
               | jeffbee wrote:
               | I don't know what's theoretical about it. There are
               | Debian packages for brotli and it just behaves like a
               | stdio codec, same as the `xz` command.
        
               | masklinn wrote:
               | Brotli has a static dictionary heavily geared towards web
               | formats, and there really is no reason to use brotli
               | outside of web stuff: the point of brotli is that it has
               | browser support and does 15~20% better than gzip.
               | 
               | It's not bad per-se, but if you're not constrained to the
               | web zstd will have a better compression ratio at every
               | throughput (and better throughput at every compression
               | ratio), and if you want maximum compression period it's
               | not close to competing with LZMA-based formats (xz,
               | lzip).
        
               | skyde wrote:
               | I think he mean is less popular (like zstandard and lz4)
               | compared to zlib and LZMA and BZip2. this is unfortunate
               | because Brotli and Zstandard compress much better than
               | zLib and LZ4 and decompress much faster than zlib lzma
               | and LZ4
        
           | 1vuio0pswjnm7 wrote:
           | IIRC, in a recent thread about zstd, someone claimed brotli
           | was faster. Perhaps one needs to look at whether the sample
           | being compressed is web stuff.
        
             | jeffbee wrote:
             | If you're too busy to check yourself, the "squash
             | compression benchmark" page has results for dozens of
             | codecs and different inputs.
        
         | pletnes wrote:
         | Linux distros often ship xz out of the box. I've never seen
         | brotli in the wild. YMMV of course - I am sure there are
         | exceptions.
         | 
         | I've used brotli myself, but only as it's included in some
         | parquet read/write libs.
        
           | peter-m80 wrote:
           | cloudflare uses brotli (not sure if optional or by default)
        
           | stillicidious wrote:
           | Probably 10-40% of what your browser is served nowadays is
           | brotli, it's in Chrome so all the major CDNs support it by
           | default
        
             | pletnes wrote:
             | I'm sure you're right, but files-on-disk is another story.
             | I was referring to the latter.
        
               | ByTheBook wrote:
               | Jpeg-XL makes use of Brotli, and once it hits 1.0 it will
               | most likely be turned on by default in browsers and thus
               | you will probably see a lot of jxl in the wild,
               | particularly since it provides lossless recompression to
               | existing jpeg files with ~20% smaller size as a result.
        
           | throwaway1777 wrote:
           | It's definitely in use by Meta and Google.
        
             | GordonS wrote:
             | I'm not the OP, but I took their comment to mean "on the
             | command line".
        
         | dragontamer wrote:
         | But this seems like a patch given to the LZMA project, with an
         | interesting low-level result to boot.
        
         | [deleted]
        
         | rwmj wrote:
         | I don't know about web use, but xz's major advantage for me is
         | the seekable format. We build disk image pipelines based on
         | remote xz-compressed images where only the parts of the file
         | being accessed need to be downloaded and uncompressed.
         | https://libguestfs.org/nbdkit-xz-filter.1.html
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2021-12-13 23:00 UTC)