Subj : Re: Storing 20 million randomly accessible documents in compressed form
To   : comp.programming
From : [Jongware]
Date : Mon Sep 05 2005 06:11 pm

> I'm currently writing a program which deals with a massive 20 million
> or more very small xml documents. The program will only read (and not
> write or modify) the documents. I wanted to know what method I can use
> to be able to compress these documents (if uncompressed, they occupy
> more than 1.7 GB) and at the same time being able to access them
> randomly by a unique name/code. I need them compressed since I want to
> put them on a CD (and not a DVD).
>
> My current idea is to zip each 100 or so of them together so that I can
> efficiently decompress only 100 to access one and not all 20 million.
>
> If you have any other ideas, please let me know
>
> cheers,
> Michael

1.7 Gb/20 million files amounts to about 85 bytes per file...
think about a general zip library. if you compress _all_ files into one 
single big file the compression ratio is the largest, but even several 
large-ish files is still more economical than using 200 thousand(!) seperate 
ones. use the zip library to locate in the zip only the file you need (and 
extract just that to memory). if locating a single file out of 20 million 
takes up too much time, build a seperate index of the zipped file, read that 
first and keep it in memory as long as possible. even this index file 
doesn't need to index all 20 million files individually if the unique 
name/code allows just about any kind of tree/sort.

[jongware]
---- Everything I said was off the top of my head

.