
With version 4.0....

Table of Contents
-----------------
1. the urlmonrc file
2. FILTER=
3. FARGS=
4. using a predefined filter
5. defining your own content-based filter (CODE= tag)
6. defining your own non-content-based filter



1. the urlmonrc file

See URLMONRC.txt for more in-depth info of the new urlmonrc format.

Of interest here are the two options FILTER and FARGS.  FILTER sets which
filter will be used for that URL.  FARGS sets what arguments will be given
to the filter.

I don't advise using spaces in any regular expression arguments to FARGS.

2. FILTER=

For any URL, adding a line that says "FILTER=foo" will cause urlmon to
apply the filter named foo to this URL.

Usually, after a filter is applied the results are checksummed.  So using 
a filter forces checksum, something to keep in mind if your network is
slow, as it forces the entire URL data to be downloaded (using a GET), not
just the meta-data (using a HEAD).  (There are exceptions.  See section
6.)

3. FARGS=

Some filters require arguments, which are specified using the "FARGS=foo"
syntax.  These arguments will be fed to the filter function.

The format for having more than one argument is  "FARGS=foo1,foo2,foo3".

All of the predefined filters (listed below) will act like
standard "checksum" if there are no arguments, or the arguments are
incorrect.

Note that besides what is defined in the FARGS statement, all filters
(except timestamp) are fed as a first argument the content of the URL.
See section 5 for more information.

Note that at this time you cannot have spaces in your arguments.  This
will change as soon as I figure out how to syntactically handle it.

4. using a predefined filter

The following filters have been predefined:

timestamp
checksum
random
null
lines
regexp
pararegexp

Note that with the exception of timestamp (which doesn't rely on
checksumming) there is nothing special about any of these.  I just assume
they will be the most requested ones, so I've implemented them myself.

If you come up with a _very_ general purpose one that will be of use to
many people, let me know and I'll consider adding it as a predefined
filter.  Also, if you come up with a better (more efficient, more
flexible, or just more useful) version of any of these, also let me know.

"timestamp" and "random" are actually special cases, and are considered a
"non-content-based" filters.  In fact, they really aren't filters, but
who's keeping track.  See section 6 for information about
non-content-based filters. 

Using "random" as a filter method will cause urlmon to randomly report
that the URL has changed.  It is included as an example (along with
timestamp) of non-content-based filters, and really isn't useful.

"checksum" is as simply as possible.  It does absolutely nothing but
return the content of the URL.  The framework of the filter system will
take the checksum.  Think of this as the null filter.  It takes no
arguments from FARGS

When you add a new URL to your database, urlmon will choose either
checksum or timestamp.  It will use timestamp unless there is no
last_modified data provided by the web server, or if the "-s" option is
set, forcing checksums.  You don't even need to enter a FILTER= line in the
urlmonrc database.  urlmon will do it for you.

"null" always returns an empty string.  After running this filter on a URL
once, it will never change.  This is good for disabling the monitoring of
a URL without taking it out of the urlmonrc database. 

"line" takes an FARGS that looks like this: "FARGS=x,y".  When this filter
gets applied, it will blank out all the lines between line number x and
line number y.  If you know there is data you don't care about that always
lies at certain line numbers, this is the filter for you, as it is 1) a
bit more efficient than some others, and 2) much easier to use.  All you
have to do is count lines.  BE SURE to count the first line as 0!!

The next two can be tricky to use, especially if you don't know how to
program in Perl.  I have enclosed some samples.

"regexp" and "pararegexps" each take an argument that needs to be valid
perl code. There is one exception: The code cannot start or end with a 
double quote (").  Quotes around the values to FARGS (or any other) tags
are stripped off before being used.  Also, you cannot have commas in it,
because commas delimit the various arguments.  This is a bad thing, and 
will be fixed next release.

"regexp" will apply its argument(s) to a variable containing the content of
the URL as a binding (=~).  Lets say that you want to remove all lines
containing the word "advertisement".  You would have the following line in
the urlmonrc file:

URL=http://host.domain.org/	MODS=XXXXX	FILTER=regexp	 FARGS=s/^.*advertisement.*$//g

So the text of http://host.domain.org/ will be placed in an array
variable called "@content" (one line per element) and the following code
will be executed:

	foreach $_ (@content) {
		$_ =~ s/^.*advertisement.*$//g;
	}

Then the updated contents of "@content" will be returned to be
checksummed.

Lets say you want to force all data to be in lower case.  You would use
this in the urlmonrc file:

URL=http://host.domain.org/     MODS=XXXXX      FILTER=regexp	FARGS=tr/A-Z/a-z/

You can also have multiple arguments for regexp, like this:

URL=http://host.domain.org/     MODS=XXXXX      FILTER=regexp FARGS=s/^.*advertisement1.*//g,s/^.*advertisement2.*//g,s/^.*advertisement3.*//g,

This is equivalent to the followin perl code:

        foreach $_ (@content) {
                $_ =~ s/^.*advertisement1.*$//g;
                $_ =~ s/^.*advertisement2.*$//g;
                $_ =~ s/^.*advertisement3.*$//g;
        }

You can do this with 'pararegexp', too.


The limitation of regexp is that the pattern matching only applies to
one line of the content at a time.  This makes writing the regular
expression pattern matching easier, but won't really be too useful as it
is legal for HTML content statements to cross multiple lines.

To solve this, "pararegexp" (which means paragraph regexp, but shouldn't
because it has nothing to do with paragraphs) will apply its argument to
the entire contents of the URL.  Remember to use the /g modifier (like
one of the above examples) to match more than one instance of whatever
you are going to match.

Working with multi-line patterns and regexps can be tricky.  
"." won't match a newline, so if you want to match, say,
everything up to the string "END HEADER", use the following:

FARGS=s/^[\000-\777]*END\sHEADER//

Or, you can use the /s modifier, which makes '.' match '\n':

FARGS=s/^.*END\sHEADER//s

Note that the ^ and $ match the beginning and ending of the entire string
unless you use the /m modifier.  This will cause ^ and $ to match
internal '\n', so will treat the string as multiple lines.  See some good
docs on Perl regexps for more info.


5. defining your own content-based filter

YOU MUST KNOW (some) PERL TO DO THIS!

As of verion 3.0, urlmon has full capabilities to allow user-defined
filters.  As of version 4.0, I have a nifty way to define filters on a 
per-user basis.

In the urlmonrc file, you can have a CODE=file (or CODE=/path/to/file) tag
that tells urlmon to read in the given file and execute it as perl code.

The perl code in this file will get executed in the 'urlmonfilter' namespace
so as not to pollute the 'main' namespace.  (Namespace is in perl an idea having
to do with variable scope.)  The intended usage is to allow the user to
modify the variable %filtercode.  The user can add new filters, delete filter,
or modify filters.

Note that the CODE tag's location in the urlmonrc file only
specifies the order in which one code file will be read (first come, first
'require'd) but they are all read in before any URLs are processed.

It used to be that you had to
actually edit the urlmon file and change a variable called '%filtercode'.
This was bad because when you upgrade, you'll have to cut and paste your
custom filters from the old version to the new.  

urlmon still has some definitions in the %filtercode hash, but you can 
override (or delete) them if necessary.  If you want to add to the filter 
code file, you'd do something like this (where 'foo' is the filter to be 
added):

$filtercode{'foo'} = sub { # new filter code };

To delete, something like this:

delete $filtercode{'foo'};

If you want to totally replce the default filter code, you'd just use something
like the examples below.

In any case, the last line of your file should be '1;' to keep perl
happy.

The custom filter code is kept in its own perl package namespace (called
urlmonfilter).  This is done so that the variable names used in the
customizable filter won't conflict with the main urlmon code.  The filter
code itself is kept in the hash array called '%filtercode'.  The key to the
hash array is whatever string you want to use on the "FILTER=" line of the
urlmonrc file, and the value is a subroutine that will apply your filter
algorithm to the content of the URL.

As an example, look a the code for checksum:


%filtercode = (
        'checksum'      =>      sub { shift },
	# other filters snipped
);


Here is a more explicit version of checksum:

%filtercode = (
        'checksum'      =>      sub {
		my($content) = @_;
		return $content;
	},

        # other filters snipped
);

All filters, when executed, will be passed as a first argument the content
of the URL they are to filter.  They must return some altered form of the
$content.  (Actually, they could simply make something up.  That's the
beauty of this modular system.  The code for "null" is this:

	'null'		=>	sub { ''; },

This just returns an empty string, regardless of the content.)

You can also accept other arguments through the FARGS statement.  Each
element of the FARGS statement, separated by commas, will wind up inthe @_
array for your subroutine.  Remember that $_[0] will be the content of the
URL.  $_[1] throught $_[n] will be the values specified on the FARGS
statement.  Remember also to do some kind of sanity checking on the args.
It is a good idea, in the event that the args were forgotten or were
invalid, to simply print a warning to standard error and then return the
untouched URL content.  This way your filter will act as though it were
the standard "checksum" method.

There are two reasons to define your own content-based filter.  The first
is for convencience.  Say you monitor many different pages on one web
site, and on all of these pages lines 4 through 6 have data that you want
to filter out before making a checksum.  Instead of setting FILTER to
"lines" and FARGS to "4,6" for every URL that is on this site, you could
define a filter called "site4to6" (where "site" could be a descriptive
name for the site it applies to)  and it would look like this: 

	"site4to6"	=>	sub {
		my ($content) = @_;
		&{$filtercode{'lines'}}($content, 4, 6);
	},

Basically, this is a wrapper to the the "lines" filter, with the arguments
4 and 6 hardcoded.

The other reason for defining your own is if you have a URL (or URLs) with
special characteristics, and none of the predefined filters work well. 
Lets say we have a site where we are only concerned with the changing of
one particular line of the content, the line that says "Last Modified".

We could do this:

	"lastmod"	=>	sub {
		my ($content) = @_; # no other args are necessary.  If
				    # anything is present in FARGS, it
				    # (they) will be ignored.

		my (@content) = split /\s*[\r\n]+\s*/, $content;

		foreach $line (@content) {
			if ($line =~ /Last Modified:\s*(.*)/i) {
				return $1; # return the modification time
					   # as the data to be checksummed
			}
		}
		return '';
	}, # END of "lastmod"

If you had, say, two arguments for a filter (so you would write
"FARGS=one,two" in the urlmonrc file), you would use them in your code
like this:

	"foo"		=>	sub {
		my($content, $one, $two) = @_;

		# ...

		return $content;
	}, # END of "foo"


(You might want to avoid using "regexp" and "pararegexp" because they both
use "eval" statements, which involve a bit of overhead.  It is necessary
in order to make them general purpose, but instead you might want to copy one
of them, give it a new name, then hardcode in your substitution or
translation code, thus avoiding the "eval".) 


6. defining your own non-content-based filter

Non-content-based filters really aren't filters.  They are decision
makers.  They decide when a URL has changed.  The logic for all 
the filters (non-content and content) is in the subroutine
'dofilter' of urlmon.  "timestamp" is one, "random" is another.  These are
both explicitly defined.  If dofilter can't find the filter explicity
defined, it will look up the filter in the hash %urlmonfilter::filtercode
and execute it.

Defining your own non-content-based filter involves adding your own code
directly to urlmon. It isn't hard.  Find the line that says: 

	if (!$opt_o and $filter eq 'timestamp') {
	
		...

	} elsif (...) {

This is the spot.  You can add checks for other filters in subsequent
'elsif' calls (see the line for 'random' for an example).  In the body of
these elsif statements, you must return either a 1 or a 0, 1 for "yes, the
URL has changed", 0 for "no, it hasn't changed".

You are allowed to store state information in the variable "$mods{$url}".
This will me written out to the urlmonrc file (NO SPACES, at least not
now).  You will have access to this info on subsequent checks.  Note that
I haven't tested any other types of data other than timestamps and checksums,
so you might get weird warnings or weird behaviour if I made any assumptions.

Possible reasons for writing one of these would be if you want to be
alerted that, for example, some part of the headers from the web server
has changed.  You can get access to the HTTP connection throught a
variable called "$response".  This variable is an HTTP::Response object,
and you can learn about its data and methods from the LWP documentation.

You could monitor server changes be looking at the Server: line the
response from the web server.  Were it ever to change, you could return 1
indicating that the URL has changed.  Then, of course, you'd have to
remember that when this URL changes, it really means the web server
software has changed.  This is good information to place in the comment 
(COMM=) section or display (DISPLAY=) section.

