Subj : Re: Storing 20 million randomly accessible documents in compressed form To : comp.programming From : mwojcik Date : Fri Sep 09 2005 04:51 pm In article , Gerry Quinn writes: > In article <3o5ji4F4ap5hU1@individual.net>, r124c4u102@comcast.net > says... > > "Gerry Quinn" writes: > > > > > Compression rations of 30 or so have been reported for English text, > > > using schemes based on Huffman encoding. > > > > If you have a link handy I would be interested in reading about 30:1 with > > Huffman based coding. > > Search for lists of keywords like 'huffman encoding english text > ratio' (without apostrophes) and you'll find various links. Hmm. I did that search and looked through the first couple of pages of hits, but didn't find anything even in that order of magnitude. Plenty of sources quoted "3:1" or "30%", which is reasonable. That's not to say that there are no such results, of course, just that I didn't find them. > I don't > know anything much about the subject, and the 30 does seem high, but > when you think about it there is huge redundancy in the english > language, so I believe it is indeed plausible for generalised text (and > I am in no doubt at all that (say) 10X compression is feasible). With pure Huffman encoding? Maybe if the symbol set is words and punctuators rather than characters, across a sufficiently large sample, but I'm still a little dubious. (With a hybrid scheme of some sort rather than just Huffman, 1:10 seems quite plausible.) Of course, "general English text" is a fairly vague term. It's possible that really high ratios were achieved using somewhat lossy transformations - loss of formatting, regularization or even removal of punctuation, case folding, etc - or that the sample didn't include things that some people might consider part of general text, such as decimal numbers. > I don't know the details, but I suspect commonly used words and > phrases, perhaps even whole sentences, will have relatively short > codes. Using a symbol set other than (only) individual characters can certainly improve compression. Picking the optimal symbol set is a hard problem, though (NP-Complete, I'd guess offhand). That's why Huffman coding is usually implemented with a very simple set of fixed-length symbols (typically bytes or characters) and better compression is achieved using predictive encoders from the PPM family (which includes the BWT). -- Michael Wojcik michael.wojcik@microfocus.com Pogo: The dogs *scarcely* ever catches the rabbit. Bun: "Scarcely" got a very unpleasant ring of frequency to it. -- Walt Kelly .