Subj : Re: Storing 20 million randomly accessible documents in compressed form To : comp.programming From : Gerry Quinn Date : Wed Sep 07 2005 01:29 pm In article <3o5ji4F4ap5hU1@individual.net>, r124c4u102@comcast.net says... > "Gerry Quinn" writes: > > > Compression rations of 30 or so have been reported for English text, > > using schemes based on Huffman encoding. > > If you have a link handy I would be interested in reading about 30:1 with > Huffman based coding. That's very,very.very impressive. Did you mean to > suggest that this applied to *generalized* text? Search for lists of keywords like 'huffman encoding english text ratio' (without apostrophes) and you'll find various links. I don't know anything much about the subject, and the 30 does seem high, but when you think about it there is huge redundancy in the english language, so I believe it is indeed plausible for generalised text (and I am in no doubt at all that (say) 10X compression is feasible). I don't know the details, but I suspect commonly used words and phrases, perhaps even whole sentences, will have relatively short codes. The bigger the mass of text, the more efficient it can be, and the OP has 1.7GB to play with. - Gerry Quinn .