Subj : Re: Storing 20 million randomly accessible documents in compressed form
To   : comp.programming
From : mwojcik
Date : Wed Sep 07 2005 07:40 pm


In article <3o5ji4F4ap5hU1@individual.net>, "osmium" <r124c4u102@comcast.net> writes:
> "Gerry Quinn" writes:
> 
> > Compression rations of 30 or so have been reported for English text,
> > using schemes based on Huffman encoding.
> 
> If you have a link handy I would be interested in reading about 30:1 with 
> Huffman based coding.  That's very,very.very impressive.  Did you mean to 
> suggest that this applied to *generalized* text?

Shannon estimated around one bit per character as the limit for
compressing general English text, if memory serves.  So all you
need is a really verbose original encoding.  If your original
document is only English text and it's encoded in UCS-4, you
could get 30:1 with a very good encoder.

However, I suspect Gerry's misremembering, or the results he refers
to are not for general text.  The best current algorithms get down to
around two bits per character on the Calgary Corpus, for an overall
ratio of about 4:1.

-- 
Michael Wojcik                  michael.wojcik@microfocus.com

Ten or ten thousand, does it much signify, Helen, how we
date fantasmal events, London or Troy?                    -- Basil Bunting

.