Subj : Re: Storing 20 million randomly accessible documents in compressed form To : comp.programming From : mwojcik Date : Wed Sep 07 2005 07:40 pm In article <3o5ji4F4ap5hU1@individual.net>, "osmium" writes: > "Gerry Quinn" writes: > > > Compression rations of 30 or so have been reported for English text, > > using schemes based on Huffman encoding. > > If you have a link handy I would be interested in reading about 30:1 with > Huffman based coding. That's very,very.very impressive. Did you mean to > suggest that this applied to *generalized* text? Shannon estimated around one bit per character as the limit for compressing general English text, if memory serves. So all you need is a really verbose original encoding. If your original document is only English text and it's encoded in UCS-4, you could get 30:1 with a very good encoder. However, I suspect Gerry's misremembering, or the results he refers to are not for general text. The best current algorithms get down to around two bits per character on the Calgary Corpus, for an overall ratio of about 4:1. -- Michael Wojcik michael.wojcik@microfocus.com Ten or ten thousand, does it much signify, Helen, how we date fantasmal events, London or Troy? -- Basil Bunting .