Presenting version 4.0 of urlmon, a URL monitor.

PLEASE read this file, at least until the line that says:  "1. 
Description".  There is one WARNING and one NOTE that you _must_ read. 
Versions 3.x and higher are not completely compatible with earlier 
versions, so you need to know what to do.  

Written by Jeremy Impson <jdimpson@acm.org>

with suggestions, bug reports/fixes, help, and/or code from:
  Jeff Lightfoot	<jeffml@pobox.com>	
  Markus Kohler		<kohlerm@betze.bbn.hp.com>
  Fred Maciel		<fred-m@crl.hitachi.co.jp>
  Jauder Ho		<jauderho@transmeta.com>
  Michael Wiedmann	<mw@miwie.in-berlin.de>
  Ted Serreyn		<tserreyn@pop.globaldialog.com>
  Eric Raymond		<esr@thyrsus.com>
  Bill Dyess		<bill@dyess.com>
  Robin Houston		<robin@oneworld.org>
  Peter Mardahl		<peterm@langmuir.EECS.Berkeley.EDU>
  Robert Richard George 'reptile' Wal
			<reptile@reptile.eu.org>

o See the file CHANGELOG.txt for version history and a list of new features
  for this release.

o See the file FILTERS.txt to see how to use the filtering capability of
urlmon. 

o See the file URLMONRC.txt for a full description of the urlmonrc format

o See the file MODULES.txt for information on getting the right Perl
Modules in order to run urlmon.

o See section 'i.  Parallel URL monitoring' of this file for information
on the optional SystemV IPC for intercommunication between the parallel
urlmon children (as opposed to the default method, writing out temporary
files), among other things.

*-*-*>>WARNING<<*-*-*

For versions 3.0 and higher, the format of the urlmonrc (aka last
modified database) file has changed.  Recent versions _WILL_ convert it for
you, but previous versions of urlmon WILL CHOKE VIOLENTLY on the new
format, very likely destroying all your data.

BACK UP YOUR urlmonrc FILE IF YOU ARE UPGRADING FROM ANYTHING PRIOR TO
VERSION 3.0 BEFORE RUNNING VERSION 3.0 !!!!!!!!!!!!!!!!!!!!!!!!!!!

NOTE:NOTE:NOTE:NOTE:NOTE:NOTE:NOTE:NOTE:NOTE:NOTE:NOTE:NOTE:NOTE:NOTE
Also, run the old version of urlmon once before running the new version. 
I have tried my hardest to avoid there being an false positives during the
upgrade process, but I can't be sure.  By running the old version
immediately before the new, any modifications reported by the old version
are real ones, and any reported by the new version are false ones (unless
a URL got changed in the time it takes for the new one to run after the
old one.  Unlikely, but possible.)


If you are using urlmon for the first time, please pardon my shouting.
None of this applies to you :)


1. Description

urlmon was written by yours truly after I became interested in certain
episodic web sites (kind of like the old radio serials that played back in
the day...) 

I got tired of constantly checking up on them, to see if the next issue
had come out.  I don't like to do too much web-surfing, as there are more
important things in my life (like writing neat perl scripts).  So I wrote
urlmon. 

urlmon makes a connection to a web site and records the last_modified time
for that url.  Upon subsequent calls, it will check the URL again, this
time comparing the information to the previously recorded times.  (Note
that if the subsequent time is older (less than) the first, urlmon will
still assume that the URL has been updated.  I figured I'd play it safe.) 
Since the last_modified data is not required to be given by the http (it's
optional), urlmon will then take an MD5 checksum.  This is actually more
accurate, as time stamps can be faked or inaccurate.  According to the
mathematicians, changing a string without changing its MD5 signature is 
HARD, and to do so would require drastic change to the string.
Therefore, there is an option to force checksum information to
be used instead of time stamps.  But this is computationally more
intensive. 

New with version 3.0 and up, you can specify filters which can be applied to the
contents of the URLs, so that a certain amount of formatting (a lot of
formatting, if you are crafty) can be done to it before the checksum is
made.  In this way you can remove rotating advertisements and other
dynamic data that would otherwise cause the checksum to report that the
URL has changed.  (Because it has, although it is likely nothing
significant has been added to the page, especially if its an
advertisement.)  See the file FILTERS.txt for further explanation.

If possible, urlmon will request only the headers (HEAD) from each URL,
with which to decide if the URL has changed.  It will look for
last_modified info.  If it is there, urlmon can tell if the URL has
changed.  If there is no last_modified info in the headers (or if '-s' is
supplied on the command line, which forces the use of checksums, or if
use are using a content-filter), it will request all the data (GET) and
compute a checksum.  This saves bandwidth by only requesting as little as
possible.  However, if urlmon has a checksum in its database for that
URL, which indicates that previously, no last_modified info was available
(or is uses a content-filter), it will request all the data (GET) at the
outset, to minimize the amount of network connections.

urlmon is able to do similar monitoring of FTP sites, because it uses the
LWP module, which provides a more or less uniform interface to HTTP, FTP,
and NNTP.  (urlmon won't do NNTP yet, and probably won't, unless someone sends
me a patch :)  urlmon can also check local files by using the file:/path/to/file
or file://localhost/path/to/file URL format.


2. Requirements

urlmon requires perl5 or greater (but what could be greater than Perl?),
the MD5 perl module, and the LWP (Library for WWW access in Perl, aka
libwww), which in turn requires MIME-Base64 (in the MIME directory), 
HTML-Parser (in the HTML directory), MD5 (in the MD5 directory), and 
libnet (in the NET directory) modules.  These can
all be found at any CPAN archive (check out http://www.perl.com/perl/CPAN/
in the modules section for the nearest mirror near you!).  It also needs
Getopt.pm and ctime.pl, but those should have come with your perl
distribution. 

See the file MODULES.txt for an indepth discussion of how to get and install
the necessary Perl Modules.

You'll need to install these somewhere so that your script can 'use' or
'require' them (see the perl man pages for '@INC' for more info).  However,
the installation scripts that accompany the perl modules should take care
of this for you.

And of course, you'll need some type of networking set up so that you can
access the web servers you want to keep tabs on. 

Finally, you'll need some method to invoke urlmon.  It can be at the
command line, but it is useful to make crontab entries (do a 'man crontab'
on most Unix -variant, or 'man cron' if that doesn't work) for urlmon, and
have the results sent to you via email.  See some sample scripts in section
5.


3. Usage

Like any good UNIX command line utility, urlmon has quite a few options. 
First, here's a summary of its options.  What follows is a better
explanation of the uses of the command-line options. 

Invoking 'urlmon -h' yields this:

 
Usage: urlmon [-Cdcso] [-P http://my.proxy.org] [-f urlmonrcfile]  [-t timeout] [-F procs] [ <URL> | -l | -h | -p | -r <URL> ]
  <URL> is one or more space-delimited URL's to monitor (or remove).
  -l use the URLs in last_modified database (.urlmonrc) as targets to check.  (Also uses URLs on command line.)
  -h print this stuff.
  -r remove the URLs from database.  Silent if URL is not in database.
  -s use only checksums to determine if URLs have been modified.
  -d run in debug mode.
  -c curt output.  Print nothing but the changed or new URLs.  Shuts off -d.  Prints errors on SDTERR.
  -p print out contents of the last_modified database (.urlmonrc).
  -f <file>  use supplied file as last_modified database.
  -C assume that argument(s) will be url/comment pair(s).
  -P specifies a proxy server to use for ftp, gopher, and http connections.
  -t make 'timeout' the timeout length for all network connections. Defaults to 180 seconds.
  -F use 'procs' number of processes to monitor in parallel.  Defaults to 1.
  -o save all new files to the current directoy.
  Notes: 
    For ftp and http connections, the following syntax is allowed for authenticated sessions:
      ftp://username:passwd@host.domain.org/  
      http://username:passwd@host.domain.org/
    The following environmental variables will be searched for proxy servers:
      http_proxy, gopher_proxy, wais_proxy, and no_proxy

a. Basic usage.

The normal invocation is along the lines of

	urlmon http://url.one.foo http://url.two.foo ...

or

	echo http://url.one.foo | urlmon

You should only have one URL per line that goes into the standard input of
urlmon.

These will cause urlmon to look up each URL and generate checkstamps.  A
checkstamp is either a timestamp or a checksum, both of which can be used
to determine when a file was last updated.  It will then look in the
database file and see if the URLs are there.  If not (i.e.  urlmon has
never been invoked on the URLs before), it will report that they are new
(unless -c is given, see below) and add them to the database.  If they are
in the database, the newly generated checkstamp and the checkstamp
recorded in the database will be compared.  If they are the different,
urlmon reports that the URL has changed.  urlmon will also report any
errors it encounters. 

(Throughout this document, I use the word "checksum" to mean the result
of applying any content-based filter, not just the checksum filter.  
See FILTERS.txt for more information.)

Invoking 'urlmon -r URLs' will remove the URL's from the database (and
will be silent if any are not present in the database.  I know, I'm just
lazy!).  And 'urlmon -rl' WILL delete the database, so don't do it unless
you mean it. 


b. Dealing with the URL database.

See the file URLMONRC.txt for a full description of the URL database.

(The following also appears in the file URLMONRC.txt, but since its
pretty important, it should be in this, the official "README".)

	Invoking urlmon on the urlmonrc database
	----------------------------------------

	Invoking urlmon as listed in the section entitled 'Basic usage'
	of the README.txt file is a good way (well, the only way :) to
	add entries to the database.  But the power comes with the '-l'
	switch. This will cause urlmon to read in the URLs in its
	database file and treat them as though they were listed on the
	command line a la 'Basic usage'. In its simplest, the script

	        urlmon -l | mail userid@host.foo

	executed regularly will keep 'userid' aware of the status of the
	URLs in his or her URL database.  However, it will also cause
	mail to be sent even if there are no URLs changed, so something
	slightly more complicated is needed. (See section 5 for such a
	script.) 


c. Scripting

I debated putting an 'executable' flag in urlmon, which would be set to
some application that would be executed for every URL that was updated or
added. But then I realized that it would be more useful to make urlmon
condusive to scripting, rather than making it try to do everything itself. 
Therefore, giving it a '-c' option will cause it to print nothing but the
URLs that have changed or are new, one URL per line.  Any errors are
are printed on STDERR.  This option unsets '-d', described next. 

d. Debug info

An argument of '-d' will cause urlmon to print out what it is doing, every
step of the way.  This is useful if it's not behaving the way you expect. 
The messages it prints are (in my subjective opinion) fairly descriptive,
but a perusal of the source code will cause further enlightenment (as
perusal through source code usually is). 

If you have any bugs to report, send along an example of the bug with the
debugging turned on.

e. Falibility

As explained before, time stamps information provided by web servers could
be inaccurate or worse, faked (although why anyone would want to I don't
know).  An option of '-s' will cause urlmon to ignore timestamp
information and generate and store checksums.  When urlmon compares its
database info to the newly information from the web server, if the
webserver passed last_modified info but the database has a checksum,
urlmon will ignore the last_modified info and do a checksum comparison. 
This way someone who wants to always be accurate can specify '-s' once for
every URL, and always get accurate information. 

I've had one weird case where when doing HEAD requests, the web server (one of
Netscape's, I _think_) wouldn't send last_modified time, but it would when it
received a GET request.  In this case, urlmon will do decide that since there
is no last_modified time when it does its initial (HEAD) request, it must
use checksums (and do GET requests).  I'm not going to adapt urlmon for this
case as I think it is just too stupid.

Using filters can cause a bit of confusion at first.  When applying 
a filter to an URL for the first time, 
even if the file hasn't changed, urlmon will report that it
has.  Also, if you change the type of filter you use (see FILTERS.txt),
it is very likely that urlmon will think there is a change.

f. Comments in urlmonrc file

See the file URLMONRC.txt

g. Proxying

urlmon now utilizes proxying.  It does so in two ways.  You can either
set the following environmental variables like so (from the
LWP::UserAgent man page, which is utilized by urlmon): 

         gopher_proxy=http://proxy.my.place/
         wais_proxy=http://proxy.my.place/
         no_proxy="my.place"
         export gopher_proxy wais_proxy no_proxy

       Csh or tcsh users should use the setenv command to define
       these envirionment variables.

(The man page doesn't mention this, but http_proxy seems to work, too.)

Note that if these environmental shell variables are set, they will be
used automatically.

The other way is to explicitly specify the proxy server with the '-P'
option, like this:

  urlmon -P http://proxy.my.place/ http://url.to.monitor ...

This will override the variables set in the environment.  Note also that
this proxy will be used for all ftp, http, and gopher connections.

h.  Timeouts

The default amount of time urlmon will wait on a connection is 3 minutes.
You can change this by giving a '-t n' command-line switch, where 'n' is
the time in seconds.

i.  Parallel URL monitoring

urlmon can now monitor URLs in parallel by, on start up, forking off a
certain number of copies of itself.  Each copy will monitor a specific
portion of all the URLs specified for monitoring.  The number of copies
to be made is specified by the '-F n' flag, where n will be the number of
copies.  Each copy will receieve approximately 1/n of the URLs to be
monitored.  The default value if no '-F' is given is to have one copy
running.  (Actually, no copies are forked, and the original instance does
the monitoring.  In this way, the behaviour is exactly the same as for
previous versions.) 

Experimentation is needed to see what number of parallel copies is best. 
The more copies, the faster the entire database will be processed. 
However, you don't want to occupy all the resources of the system (do
you?)  and, it may be less efficient to have a copy for each URL to be
monitored than you might think, as having that many processes takes time
and resources, and as some URLs may time out, you'll end up waiting for
the time outs anyway. 

New with version 4.0, I have optional support for SysV IPC (shared memory).
This avoids having to have tmp files, which was ugly and a potential security
problem.  It is brand new and there are likely some bugs in it, especially in
the code that sends the modification data from the child processes to the 
parent.  To use it, search the body of urlmon for the line that reads:

$sysv = 0;

and change it to

$sysv = 1;

You also need the IPC::SysV perl module, available on CPAN (http://www.perl.com/CPAN/)

With SysV IPC off, urlmon will behave just as it used to, as described in the
following paragraph.  

Each copy of the child monitoring process writes out a file
with the results of the monitoring it did.  The original parent process then
collects the new data, reports, deletes these temporary files, and then
writes out an updated urlmonrc file.  The temp filename is defined in the
variable "$fork_file"  (actually, there is one file for each copy, and
the files are named by taking the string in "$fork_file" and tacking on a
dot (".") and then the process id number of that copy).  If anyone wants
this settable on the let me know. 

When perl gets threads I'll definitely take advantage of them here!

There is an experimental package that works with LWP to enable Parallel
HTTP connections.  If it also does FTP, I will rewrite urlmon to use it.
This will make urlmon simpler (of course, LWP will get correspondingly
more complicated).  This new module is called ParallelUserAgent, and it can 
be found in the LWP portion of CPAN.

j.  Authentication: Username/Passwd pairs

Although never mentioned, urlmon could always handle URLs of the form:

  ftp://username:passwd@host.domain.org/pub/dir/file

to allow you to check non-anonymous sites.  The LWP (libwww-perl) library
did all this.  However, the analogous form for http:

  http://username:passwd@host.domain.org/pub/dir/file.html

didn't work, until now.  Actually, there is partial support in LWP, in
that this URL will be correctly parse, and an HTTP connection to
host.domain.org will be made, but the username and password won't be
used.  So, I hacked at it enough to ensure that the username and password
will be.  I wonder if LWP behaves in this way because the HTTP protocols
prohibits (or at least doesn't define) this behaviour. 

(I believe LWP supports this correctly now, but I have not yet looked in
to it.  If so, urlmon will become even simpler (from my point of view,
anyway).)

k.  Saving changed files

The '-o' option...

I'm still debating removing this option.  With it, all new/updated URLs
will be saved to the current directory.  The same thing could be done by
having the output of urlmon used as input to some automated web program
(like wget) to download.  The problem is that this requires two network
requests (and potentially two transfers of the same data) for each URL,
which isn't efficient. 

The reason I'm not comfortable with it 1) this is an archival function,
where urlmon is for monitoring, 2) I've got a habit of putting too many
features in, to the point where things get really bogged down and
confusing, and 3) It's not clear what urlmon should do with the
downloaded file, i.e. where should they go and what should they get
named.  And I'm too lazy to write the sanity checking code (make sure the
directory exists that they are to be written to, make sure the file names
are legal (or at least convenient) and unique, etc.) 

The new format of the urlmonrc file is easily extendible (by a perl
programmer, anyways), so specifying a directory for the file to be
written to shouldn't be too hard.  Hmm.  I kind of like that idea.  Maybe
I'll keep this functionality after all, just to show off the benefit of
the new database format.   Does someone want to write the save code?

4. Idiosyncracies

All the things that I originally considered ideosyncracies back in the 
original release of urlmon have been removed.  Anything else that could
be considered an ideosyncracy has either been documented elsewhere or
is a bug :)

5. Miscellaneous

MAIL ME!  If you use urlmon, PLEASE PLEASE PLEASE PLEASE let me know. 
Send email to jdimpson@acm.org to let me know.  This is the first program
I've released to the world at large, and I'd be thrilled to know if
anyone finds it remotely useful.  Please, send me questions, comments,
and suggestions.


Oh, here's the sample cron script that I run twice a day, to alert me if
any URLs have changed.  After it is another script sent in by Peter Mardahl.
It formats the output into an HTML file so that you can browse new URLs 
right on the web.  Thanks Peter!


#!/bin/sh

/usr/local/bin/urlmon -cl -F 5 > /tmp/urlmon.$$ # generate a curt listing
if [ -s /tmp/urlmon.$$ ]      # check to see if we have any changed URLs
then                          # send mail if we do
  (echo "To: your@email.address";\
   echo "From: another@email.address";\
   echo "Subject: New or updated URLs"; echo "";\
   cat /tmp/urlmon.$$) | /usr/lib/sendmail -t
fi
rm /tmp/urlmon.$$




Date: Mon, 3 Aug 1998 10:51:55 -0700
From: Peter Mardahl 
To: jdimpson@acm.org
Subject: I like your program, urlmon, thanks for writing it.


It works just fine for me.

The following shellscript will put the links into a web page
instead of mailing it, a thing I find convenient:

(The web page is /accounts/peterm/public_html/changed_urls.html)

I run the shellscript in the crontab...  (Actually, I run a
different shellscript, one which won't splatter an existing
changed_urls.html:  I splatter that myself so that I don't 
miss changes if I don't look for n days.)

#!/bin/csh
/bin/rm -f /accounts/peterm/public_html/changed_urls.html
touch /accounts/peterm/public_html/changed_urls.html
/usr/local/bin/urlmon -cl -F 5 > /tmp/urlmon.$$
if ( ! -z /tmp/urlmon.$$ ) then
awk '{ printf("<A HREF=\"%s\">%s\</A>\n<br>\n",$1,$1);}' /tmp/urlmon.$$ >! /accounts/peterm/public_html/changed_urls.html
endif
chmod a+r /accounts/peterm/public_html/changed_urls.html
/bin/rm -f /tmp/urlmon.$$
