README for gspider Gopherspace spider v 1.0
Tim Fraser -- tfraser@alum.wpi.edu
27 May 2004

gspider is a collection of perl scripts used to spider gopherspace.
The spider's output is a list of gopher servers showing which are up
and which are down.  For each server that is up, the spider records
how many total selectors that server has, and the number of other
sites to which that server links.  The spider does not download or
index site content - it just downloads directories and uses the
information it finds to reach other sites.

gspider's notion of "site" or "server" is a hostname:port string.  If
a single gopher server is named by more than one hostname or
hostname:port combination, gspider will count that gopher server
multiple times.  Oops.

gspider uses the lynx text-mode web browser to download data from
gopher sites.  You'll need lynx in order to make gspider work.


[ FILES ]

LICENSE      - the GPL.
README       - this file.
gspider.init - gspider will start spidering from these gopher servers.
gspider.pm   - gspider code used by other programs.
master       - the master program for gspider.
parsepage    - a test program to parse pages downloaded with lynx -source.
slave        - spiders a single site.  master uses one of these for each site.
testpage.dmp - a test page you can feed to parsepage to test the parsing code.


[ TO SPIDER GOPHERSPACE ]

To spider all of gopherspace, do this:

> cp gspider.init gspider.state
> master -n 16

This will start spidering gopherspace from scratch using 16 parallel
slave processes.  You can choose a larger or smaller number depending
on the processing resources you have available.

Expect to see the spider create the following new files:
gspider.log - error messages show up here.
gspider.out - the results of the spidering show up here.

When the spider is done, check the gspider.out file for the results.

If you want to stop the spider before it has spidered all of
gopherspace, send the master process a SIGHUP.  This signal will not
halt the spider.  However, when the next slave process terminates, the
master process will dump the master program's state into
gspider.state.  Once the dump is complete, you can SIGTERM all the
spider processes to kill them.  Later, you can run the master program
again, and it will use the contents of gspider.state as its initial
state, so it will begin again close to where it left off.  The spider
keeps running after a SIGHUP dump - you can send it a SIGHUP
every-so-often if you're worried about some sort of crash happening
before the spidering completes.

Note that the SIGHUP saves the master program's state but not the
slave programs' states.  So, you'll lose the progress made by slaves
who have spidered only part of a site.  When you restart the master,
it will create slaves for those sites again, and the slaves will
repeat the lost work.


[ TO SPIDER A SINGLE SITE ]

You can spider a single site using the slave program directly.  Do this:

> slave -h myhost.org -p 70 -r myhost.remote -s myhost.stats -e myhost.log

The slave will traverse all of the directories for myhost.org:70.
When it completes its traversal, it will output a list of other gopher
servers linked by myhost.org:80 in myhost.remote.  It will also put
statistical information (number of selectors, links) to myhost.stats.
Errors will show up in myhost.log.

The slave program will not jump from site to site.  It spiders only a
single site.


[ NOTES ON CICADA INCOMPLETE GOPHER CENSUS ]

I wrote this software to complete what I eventually called the "Cicada
Incomplete Gopher Census".  The software wasn't perfect - I had to
look at the root pages of some sites and determine whether or not they
were valid gopher sites by hand.

There were over 100 sites whose root pages were more than 0 bytes
long, but did not contain any parse-able selectors.  Manual
examination revealed that this was due to lynx, which likes to
HTML-ify pages.  Most of these 0-selector pages were HTTP error
messages dressed up in HTML.  Over a dozen were gopher error messages
dressed up in HTML.  Half a dozen were properly-running gopher servers
that were acting as "finger" or "who" gateways.  I determined which
was which by downloading the pages manually with lynx -source and
inspecting them.  I classified most of them in groups based on the
size of the page rather than looking at each one.  For example, I
decided 106-byte pages were all a particular HTTP error message based
on an examination of a few examples, and proceeded quickly from there.




