Subj : Re: Storing 20 million randomly accessible documents in compressed form
To   : comp.programming
From : Gerry Quinn
Date : Sun Sep 11 2005 12:18 pm

In article <dfsb2h01sqg@news2.newsguy.com>, mwojcik@newsguy.com says...
> 
> In article <MPG.1d88df5f70056e6f98a553@news.indigo.ie>, Gerry Quinn <gerryq@DELETETHISindigo.ie> writes:
> > In article <3o5ji4F4ap5hU1@individual.net>, r124c4u102@comcast.net 
> > says...

> > I don't 
> > know anything much about the subject, and the 30 does seem high, but 
> > when you think about it there is huge redundancy in the english 
> > language, so I believe it is indeed plausible for generalised text (and 
> > I am in no doubt at all that (say) 10X compression is feasible).
> 
> With pure Huffman encoding?  Maybe if the symbol set is words and
> punctuators rather than characters, across a sufficiently large
> sample, but I'm still a little dubious.  (With a hybrid scheme of
> some sort rather than just Huffman, 1:10 seems quite plausible.)

That was my feeling too. I was thinking of a hybrid system but the core 
would be Huffman encoding on words and phrases.  The best compression 
would be the limit as the dictionary gets very large.

But as I said, I couldn't find those 30:1 links again - maybe I had 
some kind of brainstorm...

- Gerry Quinn

.