Subj : Re: Storing 20 million randomly accessible documents in compressed form To : comp.programming From : mwojcik Date : Tue Sep 06 2005 03:26 pm In article <431d09bb$0$97137$ed2619ec@ptn-nntp-reader03.plus.net>, Jon Harrop writes: > M.Barren@gmail.com wrote: > > I will give the BIG zip file a try to see if it compares anywhere near > > 200 MB (because I will have another 30 to 40 Million documents on the > > way). I might be expecting magic from compression technologies but I > > just want to make sure I will save as much space as I can since there's > > a little bit of competition to it as well. > > If you're compressing small files then zip/gzip is fine. For larger files > (larger than bzip's block size), bzip is likely to be substantially better. For data of this sort - very high redundancy due to XML representa- tion and the related nature of the files - star encoding (or more precisely SCLPT, an advanced form of star encoding, followed by an appropriate entropy encoder) generally outperforms even BWT-based encoders (eg bzip). (There's also the LPT star-encoding scheme, which is intended to be used as a preprocessor for bzip2. LPT followed by bzip2 isn't quite as effective as SCLPT but is useful if you want to squeeze a bit more compression out of bzip2.) See [1]; results are available in the paper cited in the References. Basic star encoding is a simple dictionary compression technique, and if zip alone did not solve the OP's problem, star encoding followed by zip might well do so. There's sample C++ code in the article I cited. All of these algorithms are related - they're all finite-context predictive models of some sort, and there are various papers demonstrating how BWT, for example, is a member of the PPM family. There are a variety of open-source PPM compressors available - see [2]. 1. http://www.dogma.net/markn/articles/Star/ 2. http://datacompression.info/PPM.shtml -- Michael Wojcik michael.wojcik@microfocus.com It wasn't fair; my life was now like everyone else's. -- Eric Severance .