[HN Gopher] Show HN: 10-40% faster LZMA decoder using x86 CMOVcc
___________________________________________________________________
Show HN: 10-40% faster LZMA decoder using x86 CMOVcc
Author : jpegqs
Score : 149 points
Date : 2021-12-13 15:29 UTC (7 hours ago)
(HTM) web link (gist.github.com)
(TXT) w3m dump (gist.github.com)
| hasmanean wrote:
| When did CMOVs become efficient again? The x86 optimization guide
| used to warn people not to use them, around 2008.
| ncmncm wrote:
| Cmov is, and was, efficient whenever branch prediction is
| unreliable.
|
| Clang will turn eligible ?: expressions into cmov. Gcc will do
| no more than one of those per basic block, subject to fragile
| conditions. Gcc got its fingers badly burned by overuse of
| cmov. [Edit: I am wrong! Gcc would not do it when I was trying.
| More research needed.]
|
| Generally, branches break dependency chains, enabling more
| implicit speculative parallelism. Older cores stuck cmov ops
| onto dependency chains, where more recently they are treated
| more like branches. That gives you less speculative evaluation,
| but consumes less state that would be discarded on a mis-
| predicted branch.
| 10000truths wrote:
| https://www.agner.org/optimize/optimizing_assembly.pdf#page=...
|
| TL;DR rule of thumb is that conditional jump is better than
| CMOV if the code is part of a dependency chain and the
| prediction rate is better than 75%.
| wolf550e wrote:
| CMOVs were terrible on Pentium 4 but ok on PPro, Pentium M,
| Core, Core 2, Sandybridge, Skylake, etc.
|
| Also ok on Atom and on AMD chips.
| jpegqs wrote:
| LZMA compression (used in 7-zip, XZ, LZMA2) is known as one of
| the best, but it has a noticeable drawback - it's slow. I tried
| to improve the decompression speed by removing excessive
| branching at decoding of every bit from the compressed data.
|
| Decompression speedup from this patch largely depends on the
| compression ratio, more ratio - less speedup. Compressed text,
| such as source code, gives the least speedup.
|
| That's the result from my Skylake, compiled with GCC. Please help
| me with testing on different x86 CPUs.
|
| x86 (32-bit) - should work, but haven't tested yet. Compiled with
| Clang should work as well.
| jack_pp wrote:
| I've recently watched this talk that may help you profile
| better : https://youtu.be/r-TLSBdHe1A
| viraptor wrote:
| When I've seen CMOVcc, my mind jumped to the callcc name and now
| I can't stop thinking in what absurd way could you create an
| actual CMOV-with-current-continuation. Conditional-MOV-value-or-
| EIP-otherwise seems too trivial.
| injinj wrote:
| Decent speedup on my 3970x. Including this patch reduced the
| number of instructions in lzma_decoder.s (using gcc 8.3.1) by
| about 8% (4417 lines of asm vs 4824). With perf stat, an
| astonishing branches-missed reduction from 409K to 104K.
|
| Using the firefox example from the gist: $ tar
| -cJf lib.tar.xz /usr/lib64/firefox
|
| The xz shipped from the system: $ perf stat xz -c
| -d lib.tar.xz > /dev/null
| Performance counter stats for 'xz -c -d lib.tar.xz':
| 4,650.32 msec task-clock:u # 1.000 CPUs utilized
| 0 context-switches:u # 0.000 K/sec
| 0 cpu-migrations:u # 0.000 K/sec
| 591 page-faults:u # 0.127 K/sec
| 19,849,912,300 cycles:u # 4.269 GHz
| (83.33%) 425,290,878 stalled-cycles-frontend:u #
| 2.14% frontend cycles idle (83.33%) 1,831,640,390
| stalled-cycles-backend:u # 9.23% backend cycles idle
| (83.34%) 23,973,036,103 instructions:u #
| 1.21 insn per cycle
| # 0.08 stalled cycles per insn (83.33%)
| 2,939,144,233 branches:u # 632.031 M/sec
| (83.34%) 409,371,860 branch-misses:u #
| 13.93% of all branches (83.33%)
| 4.650679926 seconds time elapsed 4.611657000
| seconds user 0.011931000 seconds sys
|
| The xz patched. $ git clone
| http://git.tukaani.org/xz.git $ cd xz/src $ patch -l
| -p1 < ../faster_lxma_decoder_x86.patch $ cd .. ; autogen.sh
| && configure && make $
| LD_PRELOAD=./liblzma/.libs/liblzma.so $ perf stat
| ./xz/.libs/xz -c -d ../../lib.tar.xz > /dev/null
| Performance counter stats for './xz/.libs/xz -c -d
| ../../lib.tar.xz': 3,578.54 msec task-clock:u
| # 1.000 CPUs utilized 0
| context-switches:u # 0.000 K/sec
| 0 cpu-migrations:u # 0.000 K/sec
| 593 page-faults:u # 0.166 K/sec
| 15,186,685,715 cycles:u # 4.244 GHz
| (83.32%) 108,663,507 stalled-cycles-frontend:u #
| 0.72% frontend cycles idle (83.32%) 8,753,057,119
| stalled-cycles-backend:u # 57.64% backend cycles idle
| (83.34%) 27,322,182,837 instructions:u #
| 1.80 insn per cycle
| # 0.32 stalled cycles per insn (83.35%)
| 1,979,944,734 branches:u # 553.282 M/sec
| (83.34%) 104,752,154 branch-misses:u #
| 5.29% of all branches (83.34%)
| 3.578973194 seconds time elapsed 3.549329000
| seconds user 0.011942000 seconds sys
| CyberShadow wrote:
| Would using compiler intrinsics instead of raw assembler have any
| benefits here, such as being applicable to more architectures?
| dragontamer wrote:
| Well, ideally, compilers should be compiling into CMOV vs
| Branching decisions much better.
|
| I'm shocked, but not that shocked
| (https://tenor.com/view/shocked-gif-5787388), that compilers
| today still are weaker than a dedicated raw-assembly programmer
| in these kinds of microarchitectural decisions. Even with what
| should be normal 64-bit code with compilers that have very good
| modeling of throughputs / latencies per instruction.
|
| ---------
|
| I'm now more curious as to which compilers can turn the raw
| C-code into a cmov and which compilers turn the code into the
| less efficient form (I assume branching??)
| CyberShadow wrote:
| Using CMOV vs. branching seems like a strategic decision
| which does make sense to fall under the programmer's decision
| making, as there is a trade-off of always calculating a value
| which may be unused vs. the cost of the conditional branch.
| danachow wrote:
| > as there is a trade-off of always calculating a value
| which may be unused vs. the cost of the conditional branch.
|
| I don't really understand the point you're trying to make.
| Figuring out if a value is unused is most definitely the
| purview of an optimizer. Also, the "calculating a value"
| isn't really the trade off being made between cmov and
| branching.
| CyberShadow wrote:
| > I don't really understand the point you're trying to
| make. Figuring out if a value is unused is most
| definitely the purview of an optimizer.
|
| If the conditional move doesn't happen, then the source
| (insofar as the move is concerned) is unused.
|
| Consider this pseudocode: int value =
| some_nontrivial_function_with_no_side_effects();
| if (condition) *target = value;
|
| Note that the function can be as simple as a memory read.
|
| The compiler could compile this in two ways:
|
| 1. Observing that the function's result is used only if
| condition is true, move the function call inside the if
| block.
|
| 2. Always call the function, as in the source code, but
| compile the if block to a conditional move.
|
| In such situations, it would make sense to allow
| programmers to indicate the desired strategy to the
| compiler.
|
| I suppose CPUs might elide calculating the value even
| with a conditional move if they can predict the condition
| is [likely to be] false; I don't know how true that is in
| practice.
|
| > Also, the "calculating a value" isn't really the trade
| off being made between cmov and branching.
|
| Depending on the situation and interpretation of terms, I
| also agree.
| brigade wrote:
| gcc and clang see the use of ternaries as a hint that they
| should use cmov, but it's obviously not guaranteed.
| dragontamer wrote:
| I'm trying to grok the performance improvement here.
|
| The diff points to: + /* tmp = rc.code <
| rc_bound ? rc.range = rc_bound, ~0 : 0; */ \ +
| __asm__ ( \ + "cmpl %3, %2\n\t" \ +
| "cmovbl %3, %1\n\t" \ + "sbbl %0, %0" \ +
| : "=&r"(tmp), "+&r"(rc.range) \ + : "r"(rc.code),
| "r"(rc_bound) \ + ); \
|
| This is the only use of assembly in this entire diff. That
| "tmp = rc.code < rc_bound ?..." statement looks like it'd
| probably be a cmov, at least I'd expect it to compile to
| cmov without much issue.
|
| However, this is some advanced "carry-flag manipulation"
| stuff going on here. I'm not sure if its the "cmov" per se
| that was advanced, as much as the sbbl statement (subtract
| borrow, the subtraction-analog to adc). sub %0, %0 is
| obviously "zero", but sbbl %0, %0 is "0xFFFFFFFF" if carry
| is 1.
|
| Its certainly a very well thought out bit of manual
| assembly language. The sbb is probably more important than
| the cmov (in that the compiler probably emits the cmov, but
| may not see the sbb????)
|
| ----------
|
| The original code seems to be: https://github.com/COMBINE-
| lab/xz/blob/master/src/liblzma/ra...
| #define rc_direct(dest, seq) \ do { \
| rc_normalize(seq); \ rc.range >>= 1; \
| rc.code -= rc.range; \ rc_bound = UINT32_C(0) -
| (rc.code >> 31); \ rc.code += rc.range & rc_bound;
| \ dest = (dest << 1) + (rc_bound + 1); \ }
| while (0)
|
| I don't know what its doing, but the cmov stuff is very,
| very different entirely. There doesn't seem to be a branch
| involved at all in this "rc_direct" inner-loop.
|
| Its not very clear to me how they saw this sequence of C,
| and the decided upon a cmov / sbb approach to do this
| equivalent work. Its clearly some kind of advanced thinking
| that no compiler would have gotten.
| brigade wrote:
| It's replacing rc_bit not rc_direct (rc_direct is a fixed
| .5 probability), which just uses rc_bit_last, so original
| code is: #define rc_bit_last(prob,
| action0, action1, seq) \ do { \
| rc_if_0(prob, seq) { \ rc_update_0(prob); \
| action0; \ } else { \
| rc_update_1(prob); \ action1; \ } \
| } while (0)
|
| So it's merging the two sides of rc_update and cmov/sbb
| handle the difference. action0/action1 are generally
| blank, but rc_bit_matched makes the common action
| branchless as well.
| dragontamer wrote:
| Thanks for the info.
|
| Realizing that the two sides of the "if" statement are
| effectively stored as a 1-bit of information, and then
| using instructions to store that 1-bit into the carry
| flag, and then using cmovcc / sbbl instructions to
| manipulate that 1-bit of information is pretty advanced.
|
| I could imagine a compiler making use of all that
| information, but only maybe with some programmer
| assistance (IE: code in the right format for easier
| optimizing). Or you know... an assembly programmer who
| just outright explicitly writes that kind of assembly
| language.
| brigade wrote:
| Yeah, this sort of conversion is pretty common for the
| core of arithmetic/range decoders, and I definitely don't
| expect a compiler to be capable of the sort of
| transformations needed to convert the branchy form to
| branchless.
|
| Though I'd expect compilers to be at least 90% as good
| with the ternary equivalent to the asm, e.g.:
| tmp = code < bound ? ~0 : 0; range = code < bound
| ? bound : range;
|
| compiles tmp to xor -> setl -> neg (clang) or setge ->
| movzx -> sub (gcc) instead of sbb, but on arm64 they use
| just one csetm (clang) or csinv (gcc) which is basically
| optimal, and in all cases the comparison is reused for
| the cmov.
|
| https://godbolt.org/z/jPbo3qc1Y
| ncmncm wrote:
| Correctly-predicted branching is better; but if the branch
| would have been mis-predicted, cmov is better. The compiler
| has no idea which will happen at runtime.
|
| Gcc will never under any circumstances generate two cmov
| instructions in a basic block, even when you definitely want
| that. Clang will. [Edit: I am wrong. Gcc can do more than one
| cmov in a basic block. Just not when I was trying it!]
|
| In principle, profile-guided optimization could help, if the
| instrumented run is representative of production behavior. In
| my tests it did not affect the cmov/branch choice. Neither
| did the "% expected" intrinsic. Either of those could change
| in any release.
|
| The Zstd and Lz4 decoders make very effective use of cmov.
| They should be your model.
| mmozeiko wrote:
| Here are two cmov's with gcc:
| https://godbolt.org/z/j9bcv639c
| ncmncm wrote:
| Thank you. It appears Gcc won't put two cmov _writes to
| memory_ in a block. Thus, void
| swap_if(bool c, int& a, int& b) { int ta = a, tb
| = b; a = c ? tb : ta; b = c ? ta : tb;
| }
|
| is very slow, under Gcc, when c is poorly predicted, as
| is typical when e.g. partitioning for quicksort. But how
| well it will be predicted depends on input data.
|
| [0] https://godbolt.org/z/j5W9dMjYE
| mmozeiko wrote:
| To me it looks like something related to some other
| optimization pass (I don't know much about gcc passes).
| But not related to writes to memory. Here are two writes
| both using cmov (on different code):
| https://godbolt.org/z/n3dTrPo6e
|
| Edit: compiling your code without modifications, but with
| `-Os` also gives two cmov's:
| https://godbolt.org/z/r86azb7be
| tssva wrote:
| I'm not a programmer by profession and barely one by
| hobby. I know nothing about compiler internals but
| looking at the definition of a basic block from
| Wikipedia, "In compiler construction, a basic block is a
| straight-line code sequence with no branches in except to
| the entry and no branches out except at the exit." it
| seems that the ternary operators in "return (x ? a : b) +
| (y ? c : d);" although part of the same programming block
| would be distinct compiler basic blocks.
| dragontamer wrote:
| Basic blocks are how compilers "conceptualize" our code
| and start thinking about optimization.
|
| Frankly, my recommendation to you is to ignore that
| stuff. You need like 3 or 4 years of school to reach that
| level. You should have studied easier graph optimization
| problems (ex: Finite Automatia) first, before dealing
| with code-optimization.
|
| The school curriculum is typically Programming 101 ->
| Data Structures -> Algorithms -> Finite Automata (aka:
| Regex) -> Pushdown Automata (aka: context-free grammars)
| -> Turing Machines (aka: general purpose code) ->
| Compilers (which use Finite Automata, Pushdown Automata,
| and Turing Machines simultaneously).
|
| Basic Blocks are the graph structure that compilers use
| to think about code. Its a lot of graph-theory built up
| on a lot of language theory.
|
| ------
|
| That being said: if you're actually interested in this
| stuff, then feel free to explore. But just beware, you've
| stumbled upon a very complex subject that is easily 4th
| year undergrad or even graduate-school level.
|
| With enough effort, you'll understand things. But this is
| no subject for beginners to go around exploring by
| themselves.
|
| -------
|
| That being said, I want to encourage you to explore the
| subject anyway. Just explore the subject with awareness
| that there's a lot to understand here.
|
| If you want to skip the 3ish years of elementary data-
| structures / comp. sci theory... I suggest starting with
| Static Single Assignment and working forward from here.
|
| https://en.wikipedia.org/wiki/Static_single_assignment_fo
| rm
| jeffbee wrote:
| If you're going for decoding speed wouldn't you be using brotli
| anyway? At a given compression ratio brotli is ~4x faster on
| decode path, and they are similar on the encode path.
|
| Your corpus may vary etc etc.
| orev wrote:
| Contrary to popular belief (sadly, even among the tech crowd
| here), people use computers for more than just web stuff.
| skyde wrote:
| Brotli is not just for Web stuff.
| duskwuff wrote:
| Brotli is heavily tuned for "web stuff". It's literally got
| a built-in dictionary full of commonly used HTML fragments:
|
| https://gist.github.com/klauspost/2900d5ba6f9b65d69c8e
| sebazzz wrote:
| Shoot me, works well for the average desktop (=Electron)
| application then.
| adgjlsfhk1 wrote:
| Electron is just web stuff.
| orev wrote:
| Sure. In theory. But every other response to this comment
| are clearly talking about web stuff, and in practice brotli
| is really only used for web stuff.
| jeffbee wrote:
| I don't know what's theoretical about it. There are
| Debian packages for brotli and it just behaves like a
| stdio codec, same as the `xz` command.
| masklinn wrote:
| Brotli has a static dictionary heavily geared towards web
| formats, and there really is no reason to use brotli
| outside of web stuff: the point of brotli is that it has
| browser support and does 15~20% better than gzip.
|
| It's not bad per-se, but if you're not constrained to the
| web zstd will have a better compression ratio at every
| throughput (and better throughput at every compression
| ratio), and if you want maximum compression period it's
| not close to competing with LZMA-based formats (xz,
| lzip).
| skyde wrote:
| I think he mean is less popular (like zstandard and lz4)
| compared to zlib and LZMA and BZip2. this is unfortunate
| because Brotli and Zstandard compress much better than
| zLib and LZ4 and decompress much faster than zlib lzma
| and LZ4
| 1vuio0pswjnm7 wrote:
| IIRC, in a recent thread about zstd, someone claimed brotli
| was faster. Perhaps one needs to look at whether the sample
| being compressed is web stuff.
| jeffbee wrote:
| If you're too busy to check yourself, the "squash
| compression benchmark" page has results for dozens of
| codecs and different inputs.
| pletnes wrote:
| Linux distros often ship xz out of the box. I've never seen
| brotli in the wild. YMMV of course - I am sure there are
| exceptions.
|
| I've used brotli myself, but only as it's included in some
| parquet read/write libs.
| peter-m80 wrote:
| cloudflare uses brotli (not sure if optional or by default)
| stillicidious wrote:
| Probably 10-40% of what your browser is served nowadays is
| brotli, it's in Chrome so all the major CDNs support it by
| default
| pletnes wrote:
| I'm sure you're right, but files-on-disk is another story.
| I was referring to the latter.
| ByTheBook wrote:
| Jpeg-XL makes use of Brotli, and once it hits 1.0 it will
| most likely be turned on by default in browsers and thus
| you will probably see a lot of jxl in the wild,
| particularly since it provides lossless recompression to
| existing jpeg files with ~20% smaller size as a result.
| throwaway1777 wrote:
| It's definitely in use by Meta and Google.
| GordonS wrote:
| I'm not the OP, but I took their comment to mean "on the
| command line".
| dragontamer wrote:
| But this seems like a patch given to the LZMA project, with an
| interesting low-level result to boot.
| [deleted]
| rwmj wrote:
| I don't know about web use, but xz's major advantage for me is
| the seekable format. We build disk image pipelines based on
| remote xz-compressed images where only the parts of the file
| being accessed need to be downloaded and uncompressed.
| https://libguestfs.org/nbdkit-xz-filter.1.html
| [deleted]
___________________________________________________________________
(page generated 2021-12-13 23:00 UTC)