Subj : Re: Storing 20 million randomly accessible documents in compressed form
To   : comp.programming
From : M.Barren
Date : Mon Sep 05 2005 06:13 pm

Hi,

Thanks for all your replies.

>If you can't, then your approach is reasonable (as is Jongware's).  Not
>sure that with 200,000 zip files you will get 1.7GB on a CD, though.
>You'll need a compression ratio of well over 2 to 1.  A single zip file
>gets about 1.8 to 1 for average text.  Have you done any tests to see
>what kinds of results zip is getting on your data?

For the time being I've appended all the documents one after the other
in one large xml file and bun zipped it. The zipped form is 182 MB
compared to 1.7 GB !

>XML is typically very verbose. Use your favourite programming language to
>load the XML data and then marshal it to disc in binary form and back
>again. That is a custom compression algorithm that is likely to greatly
>outperform a general-purpose compressor like zip.

Good idea. I decided to stop using xml for now and write data in a
binary format since all the documents follow a very simple structure.
At anytime where I need the original xml, I will be able to generate it
off the binary data. But having said that, I will still need to
compress the binary data since I estimate them to be well above 650 MB.

I will give the BIG zip file a try to see if it compares anywhere near
200 MB (because I will have another 30 to 40 Million documents on the
way). I might be expecting magic from compression technologies but I
just want to make sure I will save as much space as I can since there's
a little bit of competition to it as well.

Thanks again for your replies

Michael

.