          WWWOFFLE - World Wide Web Offline Explorer - Version 1.2e
          =========================================================


The WWWOFFLE programs simplify World Wide Web browsing from computers that use
intermittent (dial-up) connections to the internet.

Description
-----------

The wwwoffled program is a simple proxy server with special features for use
with dial-up internet links.  This means that it is possible to browse web pages
and read them without having to remain connected.

While Online
    - Caching of pages that are viewed for review later.
    - Conditional fetching to only get pages that have changed.

While Offline
    - The ability to follow links and mark other pages for download.
    - Browser or command line interface to select pages for downloading.
    - Optional info on bottom of pages showing cached date and allowing refresh.
    - Works with pages containing forms.

Automated Download
    - Downloading of specified pages non-interactively.
    - Can automatically fetch inlined images in pages fetched this way.
    - Can automatically fetch contents of all frames on pages fetched this way.
    - Automatically follows links for pages that have been moved.

Provides
    - An introductory page with information and links to the built-in pages.
    - Multiple indices of pages stored in cache for easy selection.
    - Interactive or command line control of online/offline status.
    - User selectable purging of pages from cache based on hostname.
    - Interactive or command line option to fetch pages and links recursively.

General
    - Can be used with one or more external proxies based on hostname.
    - Configurable to still allow use on intranets while offline.
    - Can be configured to block or not cache URLs based on file type or host.
    - Can censor outgoing HTTP headers to maintain user privacy.
    - All options controlled using a simple configuration file.
    - Optional password control for management functions.


Configuring A Web Browser
-------------------------

To use the wwwoffle programs, requires that your web browser is set up to use it
as a proxy.  The proxy hostname will be 'localhost' (or the name of the host
that wwwoffled is running on), and the port number will be the one that is used
by wwwoffled (default 8080).

Netscape V1: In the Options->Preferences dialog window, enter localhost as the
             proxy and 8080 as the port number.
Mosaic V2.6\
Lynx, Arena: Set the environment variable http_proxy to http://localhost:8080/
Emacs-W3   /

You will also need to disable the caching that the web browser performs itself
between sessions to get the best out of the program.

Depending on which browser you use and which version, it is possible to request
pages to be refreshed while offline.  This is done using the 'reload' or
'refresh' button or key on the browser.  On many browsers, there are two ways of
doing this, one forces the proxy to reload the page, and this is the one that
will cause the page to be refreshed.


Welcome Page
------------

There is a welcome page at URL 'http://localhost:8080/' that gives a very brief
description of the program and has links to the index pages, interactive control
page and the wwwoffle internet home pages.

The most important places to get information about wwwoffle are the wwwoffle
homepage 'http://www.gedanken.demon.co.uk/wwwoffle/index.html' which has
information about wwwoffle in general.  Or even better the wwwoffle version-1.2
user page 'http://www.gedanken.demon.co.uk/wwwoffle/Version-1.2/user.html' which
has more information about this version of wwwoffle.


Index Of Cached Files
---------------------

To get the index of cached files, use the URL 'http://localhost:8080/index/'.
The index allows sorting by time of caching, time of last access or
alphabetically.  This index lists all of the hosts, selecting one of these
provides an index for that host.  In this index there is a link to each page,
and a refresh button.  The link gets the cached version, the refresh button
requests a new copy if offline, or gets it if online.  From the main index,
there is also a flat index showing all the pages in the cache that have been
modified in the last week (or a user specifiable time).  This is sorted by
modification time, and split into one day intervals.


Interactive Refresh Page
------------------------

Pages can be specified by using whatever method is provided by the browser that
is used or as an alternative there is an interactive refresh page.  This allows
the user to enter a URL and then fetch it if it is not currently cached or
refresh it if it is in the cache.  There is also the option here to recursively
fetch the pages that are linked to by the page that is specified.  This
recursive fetching can be limited to pages from the same host, narrowed down to
links in the same directory (or subdirectory) or widened to fetch pages from any
web server.  This functionality is also provided in the 'wwwoffle' command line
program.


Interactive Control Page
------------------------

The behaviour and mode of operation of the wwwoffle demon can be controlled from
an interactive control page at 'http://localhost:8080/control/'.  This has a
number of buttons that change the mode of the proxy server.  These provide the
same functionality as the 'wwwoffle' command line program.  To provide security,
this page can be password protected.  There is also the facility to delete pages
from the cache or from the spooled outgoing requests directory.


The Programs and Configuration File
-----------------------------------

There are two programs that make up this utility, with three distinct functions.

wwwoffle  - A program to interact with and control the HTTP proxy demon.

wwwoffled - A demon process that acts as an HTTP proxy.
wwwoffles - A server that actually does the fetching of the web pages.

The wwwoffles function is combined with the wwwoffled function into the
wwwoffled program from version 1.1 onwards.  This is to simplify the procedure
of starting servers, and allow for future improvements.

The configuration file, called wwwoffle.conf by default contains all of the
parameters that are used to control the way the wwwoffled and wwwoffles
functions work.


WWWOFFLE - User control program
-------------------------------

The control program (wwwoffle) is used to control the action of the demon
program (wwwoffled), or to request pages that are not in the cache.

The demon program needs to know if the system is online or offline, when to
fetch the pages that have been previously requested and when to purge the cache
of old pages.


The first mode of operation is for controlling the demon process.  These are the
functions that are also available on the interactive control page.

wwwoffle -online        Indicates to the demon that the system is online.

wwwoffle -offline       Indicates to the demon that the system is offline.

wwwoffle -fetch         Commands the demon to fetch the pages that were
                        requested by browsers while the system was offline.
                        wwwoffle exits when the fetching is complete.
                        (This requires the demon to be told it is online).

wwwoffle -config        Cause the configuration file for the demon process to be
                        re-read.  The config file can also be re-read by sending
                        a HUP signal to the wwwoffled process.

wwwoffle -purge         Commands the demon to purge from the cache the pages
                        that are older than the number of days specified in the
                        configuration file, using modification or access
                        time. Or if a maximum size is specified then delete the
                        oldest pages until the maximum size is not exceeded.

wwwoffle -c <config-file>
                        Can be used with the above options to specify the
                        configuration file that contains the password and server
                        hostname.  If no password is used then no -c option is
                        needed, and the -p option can be used as needed.

wwwoffle -p <host[:port]>
                        Can be used with the above options to specify the
                        hostname and port number that the demon program listens
                        to for control messages.


The second mode of operation is to specify URLs to get.

wwwoffle <URL> .. <URL> Specifies to the demon URLs that must be fetched.
                        If online then it is got immediately, else the request
                        is stored for a later fetch.

wwwoffle -F             Force the wwwoffle server to refresh the URL.
                        (Or fetch it if not cached.)

wwwoffle -i             Specifies that the URLs when fetched are to be parsed
                        for images and these are also to be fetched.

wwwoffle -f             Specifies that the URLs when fetched are to be parsed
                        for frames and these are also to be fetched.

wwwoffle -r[<depth>]    Specifies that the URL when fetched is to have the links
                        followed and these pages also fetched (to a depth
                        specified by the optional depth parameter, default 1).
                        Only links on the same server are to be fetched.

wwwoffle -R[<depth>]    This is the same as the '-r' option except that all of
                        the links are to be followed, even those to other
                        servers.

wwwoffle -d[<depth>]    This is the same as the '-r' option except that links
                        are only followed if they are in the same directory or a
                        sub-directory.

wwwoffle -p <host[:port]>
                        Can be used to specify the hostname and port number that
                        the demon program listens to for HTTP proxy connections.


The third mode of operation is to get a URL from the cache.

wwwoffle <URL>          Specifies the URL to get.

wwwoffle -o             Get the URL and output it on the standard output.
                        (Or request it if not already cached.)

wwwoffle -p <host[:port]>
                        Can be used to specify the hostname and port number that
                        the demon program listens to for HTTP proxy connections.


The last mode of operation is to provide help in using the first two modes.

wwwoffle -h             Gives help about the command line options.


WWWOFFLED - Demon program
-------------------------

The demon program (wwwoffled) runs as an HTTP proxy and also accepts connections
from the control program (wwwoffle).

The demon program needs to maintain the current state of the system, online or
offline, as well as the other parameters in the configuration file.

As HTTP proxy requests come in, the program forks a copy of itself (the
wwwoffles function) to handle the requests.  The server program can also be
forked in response to the wwwoffle program requesting pages to be fetched.


wwwoffled -c <config-file>      Starts the demon with the named configuration
                                file.

wwwoffled -d                    Starts the demon in debugging mode, i.e it does
                                not detach from the terminal and uses standard
                                error for the log messages.

wwwoffled -h                    Gives help about the command line options.


There are a number of error and informational messages that are generated by the
program as it runs.  By default (in the config file) these go to syslog, by
using the -d flag the demon does not detach from the terminal and the errors are
also on standard error.

To have wwwoffled started automatically when the computer is booted, add the
following to the file /etc/rc.d/rc.local:

# The WWWOFFLE HTTP proxy server.
if [ -x /usr/local/sbin/wwwoffled ]; then
   /usr/local/sbin/wwwoffled -c /var/spool/wwwoffle/wwwoffle.conf
fi

By using the run-uid and run-gid options in the config file, it is possible to
change the user id and group id that the program runs as.  This will require
that the program is started by root and that the specified user has read/write
access to the spool directory.


WWWOFFLES - Server program
--------------------------

The server (wwwoffles) starts by being forked from the demon (wwwoffled) in one
of three different modes.

Real  - When the system is online and acting as a proxy for a browser.
        All requests for web pages are handled by forking a new server which
        will connect to the remote host and fetch the page.  This page is then
        stored in the cache as well as being returned to the browser.  If the
        page is already in the cache then the remote server is asked for a newer
        page if one exists, else the cache one is used.

Fetch - When the system is online and fetching pages that have been requested.
        All web page requests in the outgoing directory are fetched by the
        server connecting to the remote host to get the page.  This page is then
        stored in the cache, there is no browser active.  If the page has been
        moved then the link is followed and that one fetched.

Spool - When the system is offline and acting as a proxy for a browser.
        All requests for web pages are handled by forking a server that will
        either return a cached page or store the request.  If the page is
        cached, it is returned to the browser, else a dummy page is returned
        (and stored in the cache), and the outgoing request is stored.
        If the cached page refers to a page that failed to be downloaded then it
        will be deleted from the cache.

Depending on the existence of files in the spool and other conditions, the mode
can be changed to one of several other modes.

RealNoCache - For requests for pages on the server machine or those specified
        not to be cached in the configuration file.

RealRefresh - Used by the refresh button on the index or the wwwoffle program
        to refetch a page while the system is online.

SpoolGet - Used when the page does not exist in the cache so a request needs to
        be stored for it in the outgoing directory.

SpoolRefresh - Used when the refresh button on the index or the wwwoffle program
        are used, the existing spooled page (if there is one) is not
        overwritten, but a request is stored.

SpoolPragma - Used when the browser requests the cache to refresh the page
        using the 'Pragma: no-cache' header, the existing spooled page (if there
        is one) is not overwritten, but a request is stored.


WWWOFFLE.CONF - Configuration file
----------------------------------

The configuration file (wwwoffle.conf) specifies all of the parameters that
control the operation of the proxy server.  The file is split into sections each
containing a series of parameters as described below.  The sections are
delimited in the file by having the section name alone on a line, a line
containing a single '{', the parameters in the section and a line containing a
single '}'.  Comments are marked by a '#' at the start of the line.


StartUp - This contains the parameters that are used when the program starts,
          changes to these are ignored if the configuration file is re-read
          while the program is running.

  http-port         = <port>          ; An integer specifying the port for the
                                        HTTP proxy (default=8080).
  wwwoffle-port     = <port>          ; An integer specifying the port for
                                        wwwoffle control connections
                                        (default=8081).
  spool-dir         = <dir>           ; The name of the spool directory
                                        (default=/var/spool/wwwoffle).
  run-uid           = <user> | <uid>  ; The username or numeric uid to run the
                                        wwwoffled server as (default=none).
  run-gid           = <group> | <gid> ; The groupname or numeric gid to run the
                                        wwwoffled server as (default=none).
  use-syslog        = yes | no        ; Whether to use the syslog facility for
                                        messages (default=yes).
  password          = <word>          ; The password used for authentication of
                                        the control message (default=none).
  max-servers       = <integer>       ; The maximum number of server processes
                                        that are started (default=8).
  max-fetch-servers = <integer>       ; The maximum number of server processes
                                        that are started to fetch pages that
                                        were marked in offline mode (default=4).

  Notes: For the password to work the configuration file must be set so that
         only authorised users can read it.
       : To use the run-uid/run-gid options, the server must be start as root.
       : The max-fetch-servers value must be less than max-servers or you will
         not be able to use wwwoffle interactively online while fetching.

Options - Options that control how the program works.

  loglevel          = debug | info | important | warning | fatal
                               ; Log messages with this or higher priority
                                 (default=important).
  fetch-images      = yes | no ; Whether to fetch the images that are contained
                                 in pages that are requested while offline and
                                 downloaded later (default=no).
  fetch-frames      = yes | no ; Whether to fetch the frames that are contained
                                 in pages that are requested while offline and
                                 downloaded later (default=no).
  index-latest-days = <age>    ; The number of days to display in the index of
                                 the latest pages (default=7 days).
  add-info-refresh  = yes | no ; At the bottom of all of the spooled pages the
                                 date that the page was cached and a refresh
                                 button is to be added (default=no).
  request-changed   = yes | no ; The option to request changes to a page while
                                 online if already in the cache (default=yes).


LocalHost - A list of hosts that the host running the wwwoffled server may
            be known by.  This is so that the proxy does not need to contact
            itself to get the server local pages.

  <host> ; A hostname or IP address that in connection with the port number (in
           the StartUp section) specifies the wwwoffle proxy HTTP server.

  Notes: All of these hosts are also used the same way as those in the
         LocalNet and AllowedConnect sections.
       : The first named host is used as the server name for several features.
       : None of the entries here or in LocalNet are fetched via a proxy.


LocalNet - A list of hosts that are not to be cached by wwwoffled because they
           are on a local network.

  <host> ; A hostname or IP address that is not to be cached by the server.

  Notes: The host name matches from the right so a domain name matches all hosts
         in the domain, IP addresses match from the left.
       : All entries here are assumed to be reachable even when offline.
       : All of the hosts in LocalHost are also not cached.
       : None of the entries here or in LocalHost are fetched via a proxy.


AllowedConnect - A list of hosts that are allowed to connect to the server.

  <host> ; A hostname or IP address that is allowed to connect to the server.

  Notes: The host name matches from the right so a domain name matches all hosts
         in the domain, IP addresses match from the left.
       : All of the hosts in LocalHost are also allowed to connect.


DontCache - A list of ways of recognising a file not to cache.

  host     = <host> ; A hostname or IP address that is not to have its pages
                      cached.
  file-ext = <ext>  ; A file name extension that is not to be cached.
  <host>   = <path> ; A hostname or IP address and the path on it that is not to
                      be cached by the server.

  Notes: The host name matches from the right so a domain name matches all hosts
         in the domain, IP addresses match from the left.
       : Do not include the '.' in the file extension.
       : The path named includes all sub-directories.
       : The files will still be cached if fetched non-interactively.


DontGet - A list of hosts and paths on them that are not to be got by wwwoffled
          (because they contain only junk adverts for example).

  host     = <host> ; A hostname or IP address that is not to have its pages
                      got.
  file-ext = <ext>  ; A file name extension that is not to be got.
  <host>   = <path> ; A hostname or IP address and the path on it that is not to
                      be got by the server.

  Notes: The host name matches from the right so a domain name matches all hosts
         in the domain, IP addresses match from the left.
       : Do not include the '.' in the file extension.
       : The path named includes all sub-directories.


CensorHeader - A list of HTTP header lines that are to be removed from the
               requests sent to web servers.

  <header> ; A header field name, e.g. From, Cookie, User-Agent.

  Notes: The header is case sensitive, and does not have a ':' at the end.


Proxy - This contains the names of the HTTP proxies to use external to the
          local machine.

  default = <host[:port]> ; The hostname and port on it to use as the default
                            proxy for all pages (default=none).
  <host>  = <host[:port]> ; The hostname and port on it to use as the proxy for
                            the hostname or IP address on the left hand side.

  Notes: The host name (on the left) matches from the right so a domain name
         matches all hosts in the domain, IP addresses match from the left.
       : A hostname that matches more than one entry here uses the proxy of
         the longest matching one.
       : You can use none or no hostname to indicate that a default or
         particular host is not to use a proxy.
       : None of the hosts in LocalNet/LocalHost will be fetched via a proxy.


Purge - The method to determine which pages to purge, the default age the host
        specific maximum age of the pages in days, and the maximum cache size.

  use-mtime = yes | no ; The method to use to decide which files to purge, last
                         access time (atime) or last modification time (mtime)
                         (default=no).
  default   = <age>    ; The default maximum age of pages in days (default=28).
  <host>    = <age>    ; The maximum age of pages from the named host or IP
                         address in days.
  max-size  = <size>   ; The maximum size for the cache in MB (default=0).

  Notes: The host name matches from the right so a domain name matches all hosts
         in the domain, IP addresses match from the left.
       : A hostname that matches more than one entry here uses the age of the
         longest matching one.
       : An age of zero means not to keep, negative not to delete.
       : A maximum cache size of 0 means there is no limit to the size.
       : When there is a non-zero maximum cache size it is measured excluding
         all hosts with a negative maximum age (never purged hosts).


Author and Copyright
--------------------

The two programs wwwoffle and wwwoffled were written by Andrew M. Bishop in
1996,97 and are copyright Andrew M. Bishop 1996,97.

They can be freely distributed according to the terms of the GNU General Public
License (see the file `COPYING').

If you wish to submit bug reports or other comments about the programs then
email the author amb@gedanken.demon.co.uk and put wwwoffle in the subject line.


With Source Code Contributions From
- - - - - - - - - - - - - - - - - -

Yannick Versley <sa6z225@public.uni-hamburg.de>
        Initial syslog code (much rewritten before inclusion).

Axel Rasmus Wienberg <2wienbe@informatik.uni-hamburg.de>
        Code to run wwwoffled as a specified uid/gid.

Andreas Dietrich <quasi@baccus.franken.de>
        Code to detach the program from the terminal like a *real* demon.

Ullrich von Bassewitz <uz@wuschel.ibb.schwaben.com>
        Better handling of signals.
        Optimisation of the file handling in the outgoing directory.
        The loglevel, max-servers and max-fetch-servers config options.

And Other Useful Contributions From
- - - - - - - - - - - - - - - - - -

Too many people to mention - (everybody that e-mailed me).
        Suggestions and bug reports.
