Subj : Re: Novice needs help!
To   : comp.programming
From : Phlip
Date : Mon Aug 08 2005 11:33 pm

programmernovice wrote:

> Hi, I'd like to learn to extract information from websites
> automatically, the information is available manually but it takes a
> long time to extract this way.  I know a few basics such as looping,
> "if" statements, etc, but unfortunately learned programming in the
> Fortran days!  I know nothing about "object-based languages"

One wonders what information you seek to extract.

If you only need raw text, get a program called "lynx" and use its -dump 
command. It converts a web page to raw text, without HTML, and with only 
linefeeds for formatting.

If, however, you expect to use the HTML codes to detect the locations of 
certain data, you need an HTML parser. Get HttpUnit for Java, or WebUnit for 
Ruby, and write "test cases" for the victim^W target web pages.

Here's a sample of HttpUnit doing what you need to the Dilbert web site:

http://www.c2.com/cgi/wiki?HttpUnitTutorial

Now understand that sample targets people already stuck with Java. It is a 
very sad, complex language with a difficult learning curve and few rewards, 
so try the Ruby version next. Install Ruby, use "gem install WebUnit" or 
similar to install WebUnit, and there you go.

Scraping a web site is remarkably similar to testing one, so if you stay 
within the web testing community you should easily find more help here.

-- 
  Phlip
  http://www.greencheese.org/ZeekLand  <-- NOT a blog!!!

.