This is a basic description of how this program works.

mhtml -- a program to mirror html pages recursively
Copyright (C) 1996  Kevin M. Bealer

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version
2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public
License along with this program; if not, write to the Free
Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139,
USA.

You can send mail to the author at <kmb203@psu.edu> or:

Kevin M Bealer
94 Bowers Road
Mertztown, PA 19539




Now we have the engine that gets the first page, and all the
parts it uses.

The next step:  to translate the file into local addresses,
building lists of other files as we mosey along.

This can be done with a flex lexer.

We translate the list, building a database as we go along.  The
data base will be a (list) of type entry, where entry contains
an int and a String.  The String stores the reference and the
int stores the type, which is based on what sort of tag it was
extracted from.  (ie background, inline image, or a_href).  This
list can then be sorted, and unique entries kept.

Then we go through the list and use some criteria to select
which entries get what behavior.

1. ignore (throw away the entry)
2. stub (if the entry is not in the cache, make a stub that
	points to the "real" entry "out there".
3. fetch (download the entry and stick it in the cache)
4. mirror (download the entry and run "mhtml" on it.

In the case of mirror, the recursive invocation of mhtml could
run with a decremented hop count.  In effect, this means that
the mirror will only fetch documents that can be reached with
(n) hops or less.  (n) will be given a default value if
unspecified.

A fair default behavior would be to fetch all inlined images,
make stubs of any href's, and not fetch or mirror any href
links.
