Subj : Re: Storing 20 million randomly accessible documents in compressed form
To   : comp.programming
From : Gerry Quinn
Date : Tue Sep 06 2005 12:34 pm

In article <1125946156.818588.253910@g44g2000cwa.googlegroups.com>, 
gene.ressler@gmail.com says...
> If you can use a standard database product (like Adaptive Server
> Anywhere) that allows compressed databases, this problem disappears.
> 
> If you can't, then your approach is reasonable (as is Jongware's).  Not
> sure that with 200,000 zip files you will get 1.7GB on a CD, though.
> You'll need a compression ratio of well over 2 to 1.  A single zip file
> gets about 1.8 to 1 for average text.  Have you done any tests to see
> what kinds of results zip is getting on your data?

That seems very bad for text.  I don't doubt a custom compressor could 
quite easily be written to get well over 2:1, especially for text in 
just one or a few languages, and a limited number of symbols.

Compression rations of 30 or so have been reported for English text, 
using schemes based on Huffman encoding.

- Gerry Quinn

.