Subj : Re: Storing 20 million randomly accessible documents in compressed form To : comp.programming From : M.Barren Date : Wed Sep 07 2005 09:01 am > I will give the BIG zip file a try to see if it compares anywhere near > 200 MB (because I will have another 30 to 40 Million documents on the > way). I might be expecting magic from compression technologies but I > just want to make sure I will save as much space as I can since there's > a little bit of competition to it as well. Well apparantly zip format is not capable of creating archives containing more than 65535 files (or #ziplib that I use has this limitation). So if I'm not going to be able to use zip's archiving ability, there's no reason to go for it since other compression algorithms happen to be performing much better than zip. > Basic star encoding is a simple dictionary compression technique, and > if zip alone did not solve the OP's problem, star encoding followed > by zip might well do so. At a first glance, star encoding just blew me away. It seemed like the best solution. BUT, the result of the benchmark (in the first link) didn't seem that impressive as I thought it would be. With a bit of more thought to it, I figured it might not be appropriate for my program due to several reasons, some being as follows: 1. The documents, even with stripped xml tags, contain mostly unique words (name of places, addresses, etc.). So except a very limited number of words such as Rd, St, Av, Pl, etc. the rest are mostly unique. 2. Due to reason [1], the star codes in the dictionary will start to become longer and longer than the words they represent (using variant B as explained in the link [1]) which in turn might decrease the compressibility of data. But since I havn't yet done any test on the actual data to see how star encoding performs, I can't rule it out of my choices. Besides star encoding, I can only think of another possible solution which is to append every 200 docs together, bzip them and the append the zipped packs in a file and use a simple address table to be able to locate them in the file. My only possible concern which is almost unlikely to happen will be the perfomace issue which might arise if I need to access more than 20 different docs in 20 different packs of 200. Thanks again for your replies, it's helped me a lot so far. Michael .