http://richg42.blogspot.com/2022/01/lzxor.html
Richard Geldreich's Blog
Co-owner of Binomial LLC, working on GPU texture interchange. Open
source developer, graphics programmer, former video game developer.
Worked previously at SpaceX (Starlink), Valve, Ensemble Studios
(Microsoft), DICE Canada.
Thursday, January 6, 2022
LZ_XOR
If LZ compression is compilation, what other instructions are useful
to add? I've spent the last several years, off and on, trying to
figure this out.
LZSS has LIT [byte] and COPY [length, distance]. LZ_XOR is like LZSS
but with one new instruction the decompressor can execute. This
variant adds XOR [length, distance, 1 or more bytes].
XOR (or ADD) gives the system partial matches, which is particularly
useful on GPU texture data, log files, object/executable files,
bitmaps, and audio files. In LZ, the COPY instruction must refer to
dictionary positions that perfectly match the input file bytes
(otherwise, the decompressor will copy the wrong bytes!). LZ_XOR can
use any match distance/length pair, even referring to bytes in the
dictionary which don't perfectly match (or match at all!) the bytes
to compress, although not every distance/length pair will result in
an efficient partial match. The compiler's goal is to find those
XOR's that result in partial matches which are efficient to code.
A LZ_XOR compiler ("compressor" in LZ-parlance) can optimize for
minimum total Hamming distance between the lookahead buffer and the
previously encoded/emitted bytes in the sliding dictionary. Or, the
compressor can first optimize for minimal Hamming distance in one
pass, and then minimal bit prices in a second optimal parsing
(optimizing compiler) pass.
LZ_XOR is surprisingly strong and flexible:
1. You can constrain the XOR instruction in the compiler (parser) to
only use a limited # of unique symbols.
This property is valuable when implementing shuffle-based Huffman
decoding in AVX2/AVX-512, which is limited to only 16 or 32 unique
symbols.
2. XOR is a superset of plain LZ: With 8-bit symbols you don't need
LIT's or COPY instructions at all ("pure" LZ_XOR, or XOR-only). This
simplifies the decompressor (eliminating unpredictable jumps), and
simplifies the instruction stream so less signaling needs to be
encoded. Everything is an XOR so there's no need to tell the
decompressor that the instruction is a LIT or COPY.
3. Most of the usual tricks used to increase LZSS's ratio by other
codecs (more contexts, LZMA-like state machines+contexts, REP
matches, circular match history buffers, fast entropy coding, larger
dictionaries, etc.) are compatible with LZ_XOR too. I've implemented
and tested several LZ_XOR variants with GPU parsing that are LZ4-like
and Zstd-like, along with ideas from LZMA/LZHAM.
4. A file compressed with LZ_XOR will have significantly less overall
instructions to execute vs. LZSS to decompress the file (roughly
30-60% less). This results in a decompressor which spends more time
copying or XOR'ing bytes, and less time unpacking and interpreting
the compressed instruction stream.
5. Another strong variant on some files (audio, bitmaps) is LZ_ADD.
Instead of optimizing for minimum total Hamming distance, you
optimize for minimum total absolute error.
6. LZ_XOR is strong on uncompressible files (with very few matches),
such as GPU texture data. These are the files that LZ4 fails to
compress well.
7. LZ_XOR is like LZMA's or LZHAM's LAM's ("Literals after Matches")
taken to the next level.
Example disassembly of a simple LZ_XOR system with only LIT and XOR
constrained to only use 32 unique XOR bytes (no copies - they are
implemented as XOR's with all-0 XOR bytes):
[image]
Another example that uses XOR, COPY, and LITS, along with REP0/1
distances (from LZMA), and a significantly more complex control/
instruction stream:
[image]
Here's "Alice in Wonderland" compressed with plain LZSS on the left,
and LZ_ADD on the right. Notice how much faster it makes progress
through the file vs. LZSS:
[image]
Compressing the 25 char ASCII string "executable files go here".
First run uses LZ-style LIT+COPY plus a new instruction, ADD. Second
run just uses LIT+COPY:
[image]
XOR-only decompression can be done in two simple steps, using a
single output buffer: first entropy decode the XOR bytes into a
buffer the size of the output block. This can be done at ~1 GB/sec.
using a fast order-0 Huffman or FSE decoder, such as this one. The
bulk of the XOR byte values will be 0. The frequency histogram will
have one enormous spike with a quick falloff.
Next, execute the XOR control stream and XOR the bytes in the sliding
dictionary with the "patch" bytes in the lookahead buffer. (The
"patch" bytes are in the brackets in the above disassemblies. In
XOR-only there's guaranteed to be a single XOR byte for every byte in
the block.) This step is roughly 1-1.5 GiB/sec. It's in-place so it's
very cache friendly.
LZ_XOR places very heavy demands on the compressor's parser to find
the partial matches with minimal Hamming distance (or emitted bits),
making it a good fit for GPU parsing. My experimental compiler
currently evaluates every single possible way to code the lookahead
bytes against all the bytes in the sliding dictionary. It uses
Dijkstra's shortest path algorithm, where each edge is an
instruction, the costs are bit prices, and the nodes are lookahead
byte positions. There are numerous heuristics and search
optimizations that can be done to speed up partial matching. This is
the Approximate String Matching Problem.
LZ_XOR gives an RDO GPU texture compressor a lot more freedom to
distort a texture block's encoded bytes to increase ratio. With plain
LZ/LZSS-style systems, your primary tool to increase the compression
ratio (and trade off texture quality) is to increase the number and
length of the LZ matches in the encoded texture data. With LZ_XOR you
can replace bytes with other bytes which minimize the Hamming
distance between the block you're encoding and the already coded
blocks in the sliding dictionary. This is a more flexible way of
increasing distortion without slamming in large 3-4 byte matches.
While building the above experimental codecs, I also would use the
GPU to compute Hamming Correlation Matrix visualizations. This is for
the first 494 bytes of alice29.txt (Alice in Wonderland). Each row
represents how much the 18 byte pattern starting at that file offset
correlates with all the previous byte patterns. White=low Hamming
distance, black=high distance.
[image]
Posted by Rich Geldreich at 12:44 AM #
Email ThisBlogThis!Share to TwitterShare to FacebookShare to
Pinterest
No comments:
Post a Comment
Newer Post Older Post Home
Subscribe to: Post Comments (Atom)
Blog Archive
* V 2022 (7)
+ V January (7)
o Lagrangian RDO PNG
o Fast AVX2 PNG writer
o LZ_XOR on canterbury corpus
o One nice property of highly constrained length lim...
o LZ_XOR on enwik8
o Other LZ_XOR variants
o LZ_XOR
* > 2021 (25)
+ > November (1)
+ > February (23)
+ > January (1)
* > 2020 (12)
+ > September (1)
+ > August (1)
+ > April (5)
+ > February (3)
+ > January (2)
* > 2019 (12)
+ > October (2)
+ > September (1)
+ > April (9)
* > 2018 (43)
+ > October (2)
+ > July (3)
+ > June (6)
+ > May (9)
+ > April (17)
+ > March (2)
+ > February (4)
* > 2017 (24)
+ > November (5)
+ > September (1)
+ > August (1)
+ > June (2)
+ > April (2)
+ > March (4)
+ > February (2)
+ > January (7)
* > 2016 (77)
+ > December (3)
+ > November (2)
+ > October (19)
+ > September (33)
+ > August (5)
+ > July (3)
+ > June (1)
+ > May (4)
+ > April (3)
+ > March (1)
+ > January (3)
* > 2015 (62)
+ > December (12)
+ > November (9)
+ > October (6)
+ > June (2)
+ > May (6)
+ > April (1)
+ > March (1)
+ > February (5)
+ > January (20)
* > 2014 (34)
+ > November (1)
+ > October (2)
+ > June (6)
+ > May (3)
+ > April (2)
+ > March (11)
+ > February (5)
+ > January (4)
* > 2013 (6)
+ > November (1)
+ > October (4)
+ > August (1)
* > 2012 (4)
+ > November (1)
+ > August (1)
+ > July (2)
About Me
Rich Geldreich
Seattle, WA, United States
Back in the day I worked for several years at Digital Illusions
on things like the first shipping deferred shaded game ("Shrek" -
2001), software renderers, and game AI. Then, after working for
Microsoft at Ensemble Studios for 5 years as engine lead on Halo
Wars, I took a year off to create "crunch", an advanced DXTc
texture compression library. I then worked 5 years at Valve,
where I contributed to Portal 2, Dota 2, CS:GO, and the Linux
versions of Valve's Source1 games. I was one of the original
developers on the Steam Linux team, where I worked with a
(somewhat enigmatic) multi-billionare on proving that OpenGL
could still hold its own vs. Direct3D. I also started the vogl
(Valve's OpenGL debugger) project from scratch, which I worked on
for over a year. In my spare time I work on various open source
lossless and texture compression projects: crunch, LZHAM, miniz,
jpeg-compressor, and picojpeg.
View my complete profile
Simple theme. Powered by Blogger.