Path: news.ruhr-uni-bochum.de!news.rhrz.uni-bonn.de!RRZ.Uni-Koeln.DE!news.gtn.com!blackbush.xlink.net!howland.erols.net!newsfeed.internetmci.com!newsfeed.direct.ca!nntp.teleport.com!usenet From: sb@sdm.de (Steffen Beyer) Newsgroups: comp.lang.perl.announce,comp.lang.perl.misc Subject: ANNOUNCE: generate tree representation of WWW site Followup-To: comp.lang.perl.misc Date: 1 Sep 1996 21:39:24 GMT Organization: sd&m GmbH & Co. KG Munich, Germany Lines: 112 Approved: merlyn@stonehenge.com (comp.lang.perl.announce) Message-ID: <50cvqc$e85@nadine.teleport.com> Reply-To: sb@sdm.de (Steffen Beyer) NNTP-Posting-Host: kelly.teleport.com X-Disclaimer: The "Approved" header verifies header information for article transmission and does not imply approval of content. Xref: news.ruhr-uni-bochum.de comp.lang.perl.announce:412 comp.lang.perl.misc:43786 Recently I have written a Perl script to generate a tree representation of a complete WWW site, or subtrees thereof, which I think might be useful to others as well. This is to give the visitors of your web site a useful overview of all the pages you offer, where they are, and where they have already been. Please find more details about this script in the following excerpt of the README file that goes with it! Please download the script in question from http://www.sdm.de/e/www/hilfe/gen_tree-1.1.tar.gz or ftp://..../..../CPAN/authors/id/STBEY/gen_tree-1.1.tar.gz on any CPAN (= Comprehensive Perl Archive Network) ftp server near you if you're interested. (See "The Perl 5 Module List" by Tim Bunce and Andreas Koenig in news:comp.lang.perl.modules for a list of CPAN ftp servers) Most important: Enjoy! :-) Requirements: Perl version 5.002 or higher. Compatibility of your web pages with the Apache HTTP server. (Concerning the syntax of server side includes and server side image maps) What does it do: This script scans the tree (better: the directed graph) of HTML pages of a web site. (It's not always a tree because circles and loops are possible!) It starts at the home page of that site (called the "root page" here) and follows all hyperlinks in a recursive descent (width first, in order to produce a representation in the expected way). (You can also scan just a subtree of your web site if you want) Since it scans files in the file system of the host bearing the web site, it is confined to pages lying physically on one host (!). The web server (HTTP daemon) of the web site is NOT used at all (!). (That's also why it doesn't use the libwww (LWP) module) Circles and loops are recognized through unique identification of each page by the device and inode numbers of its corresponding file. Therefore, this script is confined to UNIX hosts or hosts where the device and inode numbers returned by "stat" serve the same purpose as with UNIX. One could abandon this latter restriction if one used checksums for identification instead. This is not 100% reliable, however. When scanning of the web site is complete, an HTML page is generated which contains all the pages found in form of one hyperlink to each of them. (The parse tree that is built in memory during the scanning phase is traversed in a recursive descent, this time depth first, to yield a tree that looks the expected way.) The tree structure of the web site is reflected in this page by the indentation of these hyperlinks. The text which is displayed in these hyperlinks is extracted from the