Subj : Re: Storing 20 million randomly accessible documents in compressed form
To   : comp.programming
From : mwojcik
Date : Fri Sep 09 2005 04:51 pm


In article <MPG.1d88df5f70056e6f98a553@news.indigo.ie>, Gerry Quinn <gerryq@DELETETHISindigo.ie> writes:
> In article <3o5ji4F4ap5hU1@individual.net>, r124c4u102@comcast.net 
> says...
> > "Gerry Quinn" writes:
> > 
> > > Compression rations of 30 or so have been reported for English text,
> > > using schemes based on Huffman encoding.
> > 
> > If you have a link handy I would be interested in reading about 30:1 with 
> > Huffman based coding.
> 
> Search for lists of keywords like 'huffman encoding english text 
> ratio' (without apostrophes) and you'll find various links.

Hmm.  I did that search and looked through the first couple of pages
of hits, but didn't find anything even in that order of magnitude.
Plenty of sources quoted "3:1" or "30%", which is reasonable.  That's
not to say that there are no such results, of course, just that I
didn't find them.

> I don't 
> know anything much about the subject, and the 30 does seem high, but 
> when you think about it there is huge redundancy in the english 
> language, so I believe it is indeed plausible for generalised text (and 
> I am in no doubt at all that (say) 10X compression is feasible).

With pure Huffman encoding?  Maybe if the symbol set is words and
punctuators rather than characters, across a sufficiently large
sample, but I'm still a little dubious.  (With a hybrid scheme of
some sort rather than just Huffman, 1:10 seems quite plausible.)

Of course, "general English text" is a fairly vague term.  It's
possible that really high ratios were achieved using somewhat lossy
transformations - loss of formatting, regularization or even removal
of punctuation, case folding, etc - or that the sample didn't include
things that some people might consider part of general text, such as
decimal numbers.

> I don't know the details, but I suspect commonly used words and 
> phrases, perhaps even whole sentences, will have relatively short 
> codes.

Using a symbol set other than (only) individual characters can
certainly improve compression.  Picking the optimal symbol set is a
hard problem, though (NP-Complete, I'd guess offhand).  That's why
Huffman coding is usually implemented with a very simple set of
fixed-length symbols (typically bytes or characters) and better
compression is achieved using predictive encoders from the PPM family
(which includes the BWT).

-- 
Michael Wojcik                  michael.wojcik@microfocus.com

Pogo: The dogs *scarcely* ever catches the rabbit.
Bun:  "Scarcely" got a very unpleasant ring of frequency to it.
 -- Walt Kelly

.