Subj : Re: Storing 20 million randomly accessible documents in compressed form
To   : comp.programming
From : M.Barren
Date : Wed Sep 07 2005 09:01 am

> I will give the BIG zip file a try to see if it compares anywhere near
> 200 MB (because I will have another 30 to 40 Million documents on the
> way). I might be expecting magic from compression technologies but I
> just want to make sure I will save as much space as I can since there's
> a little bit of competition to it as well.

Well apparantly zip format is not capable of creating archives
containing more than 65535 files (or #ziplib that I use has this
limitation). So if I'm not going to be able to use zip's archiving
ability, there's no reason to go for it since other compression
algorithms happen to be performing much better than zip.

> Basic star encoding is a simple dictionary compression technique, and
> if zip alone did not solve the OP's problem, star encoding followed
> by zip might well do so.

At a first glance, star encoding just blew me away. It seemed like the
best solution. BUT, the result of the benchmark (in the first link)
didn't seem that impressive as I thought it would be. With a bit of
more thought to it, I figured it might not be appropriate for my
program due to several reasons, some being as follows:

1. The documents, even with stripped xml tags, contain mostly unique
words (name of places, addresses, etc.). So except a very limited
number of words such as Rd, St, Av, Pl, etc. the rest are mostly
unique.
2. Due to reason [1], the star codes in the dictionary will start to
become longer and longer than the words they represent (using variant B
as explained in the link [1]) which in turn might decrease the
compressibility of data.

But since I havn't yet done any test on the actual data to see how star
encoding performs, I can't rule it out of my choices.

Besides star encoding, I can only think of another possible solution
which is to append every 200 docs together, bzip them and the append
the zipped packs in a file and use a simple address table to be able to
locate them in the file. My only possible concern which is almost
unlikely to happen will be the perfomace issue which might arise if I
need to access more than 20 different docs in 20 different packs of
200.

Thanks again for your replies, it's helped me a lot so far.
Michael

.