Subj : Re: Storing 20 million randomly accessible documents in compressed form
To   : comp.programming
From : mwojcik
Date : Tue Sep 06 2005 03:26 pm


In article <431d09bb$0$97137$ed2619ec@ptn-nntp-reader03.plus.net>, Jon Harrop <usenet@jdh30.plus.com> writes:
> M.Barren@gmail.com wrote:
> > I will give the BIG zip file a try to see if it compares anywhere near
> > 200 MB (because I will have another 30 to 40 Million documents on the
> > way). I might be expecting magic from compression technologies but I
> > just want to make sure I will save as much space as I can since there's
> > a little bit of competition to it as well.
> 
> If you're compressing small files then zip/gzip is fine. For larger files
> (larger than bzip's block size), bzip is likely to be substantially better.

For data of this sort - very high redundancy due to XML representa-
tion and the related nature of the files - star encoding (or more
precisely SCLPT, an advanced form of star encoding, followed by an
appropriate entropy encoder) generally outperforms even BWT-based
encoders (eg bzip).  (There's also the LPT star-encoding scheme,
which is intended to be used as a preprocessor for bzip2.  LPT
followed by bzip2 isn't quite as effective as SCLPT but is useful if
you want to squeeze a bit more compression out of bzip2.)  See [1];
results are available in the paper cited in the References.

Basic star encoding is a simple dictionary compression technique, and
if zip alone did not solve the OP's problem, star encoding followed
by zip might well do so.  There's sample C++ code in the article I
cited.

All of these algorithms are related - they're all finite-context
predictive models of some sort, and there are various papers
demonstrating how BWT, for example, is a member of the PPM family.
There are a variety of open-source PPM compressors available - see
[2].


1. http://www.dogma.net/markn/articles/Star/
2. http://datacompression.info/PPM.shtml

-- 
Michael Wojcik                  michael.wojcik@microfocus.com

It wasn't fair; my life was now like everyone else's.  -- Eric Severance

.