Subj : Re: Storing 20 million randomly accessible documents in compressed form
To   : comp.programming
From : Alex Fraser
Date : Thu Sep 08 2005 11:22 am

"Joe Wright" <jwright@comcast.net> wrote in message
news:1YqdnZ1EH_f7F4LeRVn-2Q@comcast.com...
> Michael Wojcik wrote:
> > In article <431e7f8c$0$17734$afc38c87@news.optusnet.com.au>, "DarkD"
> > <darkd@NOSPAMoptusworld.com.SPAM.au> writes:
> >>"Gene" <gene.ressler@gmail.com> wrote in message
> >>news:1125946156.818588.253910@g44g2000cwa.googlegroups.com...
> >>
> >>>A single zip file gets about 1.8 to 1 for average text.
> >>
> >>1.8 to 1? I think you are thinking of the ratio for random ASCII
> >>display characters. Typical compressed books etc. have a huge ratio of
> >>about 30:1
> >
> > They do not, unless the source representation is extremely bloated.
> >
> > I just did a couple of tests with large, highly-redundant ASCII
> > documents (the Perl 5 change log, for example) and gzip -9 just to
> > confirm, and didn't see anything better than about 5:1.
> >
> > If you believe otherwise, cite a source.
>
> I have a 'folder' of 392 program files (*.c) comprised of 366,692 bytes.
> Using my favorite zipper..
>
> pkzip x.zip *.c
>
> I find x.zip to be 191,539 bytes. That's about 1.91 compression on text
> files.

Each file is independently compressed, and the average size is under 1KB. If
you concatenated the files you would probably (depending partly on
compression settings) get significantly better compression. 4:1 would not be
unusual if there are relatively few bytes in comments/string literals.

Alex

.