Subj : Re: Storing 20 million randomly accessible documents in compressed
To   : comp.programming
From : Joe Wright
Date : Wed Sep 07 2005 09:48 pm

Michael Wojcik wrote:
> In article <431e7f8c$0$17734$afc38c87@news.optusnet.com.au>, "DarkD" <darkd@NOSPAMoptusworld.com.SPAM.au> writes:
> 
>>"Gene" <gene.ressler@gmail.com> wrote in message
>>news:1125946156.818588.253910@g44g2000cwa.googlegroups.com...
>>
>>>A single zip file gets about 1.8 to 1 for average text.
>>
>>1.8 to 1? I think you are thinking of the ratio for random ASCII display
>>characters. Typical compressed books etc. have a huge ratio of about 30:1
> 
> 
> They do not, unless the source representation is extremely bloated.
> 
> I just did a couple of tests with large, highly-redundant ASCII
> documents (the Perl 5 change log, for example) and gzip -9 just to
> confirm, and didn't see anything better than about 5:1.
> 
> If you believe otherwise, cite a source.
> 
Hi Michael

I have a 'folder' of 392 program files (*.c) comprised of 366,692 bytes.
Using my favorite zipper..

pkzip x.zip *.c

I find x.zip to be 191,539 bytes. That's about 1.91 compression on text
files. Another test of a large .dbf table was better..

08/22/2005  05:03        59,317,319 MBRS.DBF
09/07/2005  20:21         6,141,204 MBRS.ZIP

...for a 9.66 compression. The .dbf format is rich in space characters
and the zip algos are very efficient at taking care of oft repeated
bytes. 10:1 is about the best compression I have seen. So somewhere
between 2:1 and 10:1 seems rational in my small world.

I need a lot more information about 30:1 compression.

-- 
Joe Wright
"Everything should be made as simple as possible, but not simpler."
                     --- Albert Einstein ---

.