Subj : Re: Storing 20 million randomly accessible documents in compressed form
To   : comp.programming
From : Gerry Quinn
Date : Wed Sep 07 2005 01:29 pm

In article <3o5ji4F4ap5hU1@individual.net>, r124c4u102@comcast.net 
says...
> "Gerry Quinn" writes:
> 
> > Compression rations of 30 or so have been reported for English text,
> > using schemes based on Huffman encoding.
> 
> If you have a link handy I would be interested in reading about 30:1 with 
> Huffman based coding.  That's very,very.very impressive.  Did you mean to 
> suggest that this applied to *generalized* text?

Search for lists of keywords like 'huffman encoding english text 
ratio' (without apostrophes) and you'll find various links.  I don't 
know anything much about the subject, and the 30 does seem high, but 
when you think about it there is huge redundancy in the english 
language, so I believe it is indeed plausible for generalised text (and 
I am in no doubt at all that (say) 10X compression is feasible).

I don't know the details, but I suspect commonly used words and 
phrases, perhaps even whole sentences, will have relatively short 
codes.  The bigger the mass of text, the more efficient it can be, and 
the OP has 1.7GB to play with.

- Gerry Quinn

.