Newsgroups: comp.os.minix
Path: utzoo!henry
From: henry@utzoo.uucp (Henry Spencer)
Subject: Re: compression
Message-ID: <1989Jul18.174647.19537@utzoo.uucp>
Organization: U of Toronto Zoology
References: <2888@ast.cs.vu.nl>
Date: Tue, 18 Jul 89 17:46:47 GMT

In article <2888@ast.cs.vu.nl> ast@cs.vu.nl (Andy Tanenbaum) writes:
>I wonder if better compression of C programs is possible...
>sort of like libpack.c does, only dynamically instead of using fixed strings.
>... It is my suspicion that such a program could compress
>better than a factor of 2 on C programs.

Andy, I just ran some quick tests using some C-analysis stuff I've got,
and I doubt that a simple approach will give you more than a factor of 2-3.
I ran a few large C programs through a tokenizer (one which retains white
space), and counted both the number of tokens (approximating the number of
output codewords, ignoring limits on codeword size) and the size of the
output after "sort -u" (approximating the size of the codeword dictionary).
This is actually an optimistic estimate because of the limits on codeword
size and the fact that my tokenizer essentially eliminates comments.  Best
case was about a factor of 3.  A quick look at eliminating all white space
(i.e. we assume a C-specific compressor whose decompressor includes a
paragrapher) suggests that this might perhaps get it to a factor of 4 in
favorable cases.  All in all, it doesn't seem a promising approach.
-- 
$10 million equals 18 PM       |     Henry Spencer at U of Toronto Zoology
(Pentagon-Minutes). -Tom Neff  | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
