Subj : Re: Storing 20 million randomly accessible documents in compressed form To : comp.programming From : Gerry Quinn Date : Sun Sep 11 2005 12:18 pm In article , mwojcik@newsguy.com says... > > In article , Gerry Quinn writes: > > In article <3o5ji4F4ap5hU1@individual.net>, r124c4u102@comcast.net > > says... > > I don't > > know anything much about the subject, and the 30 does seem high, but > > when you think about it there is huge redundancy in the english > > language, so I believe it is indeed plausible for generalised text (and > > I am in no doubt at all that (say) 10X compression is feasible). > > With pure Huffman encoding? Maybe if the symbol set is words and > punctuators rather than characters, across a sufficiently large > sample, but I'm still a little dubious. (With a hybrid scheme of > some sort rather than just Huffman, 1:10 seems quite plausible.) That was my feeling too. I was thinking of a hybrid system but the core would be Huffman encoding on words and phrases. The best compression would be the limit as the dictionary gets very large. But as I said, I couldn't find those 30:1 links again - maybe I had some kind of brainstorm... - Gerry Quinn .