Subj : Re: Novice needs help!
To   : comp.programming
From : programmernovice
Date : Tue Aug 09 2005 09:33 am


Phlip wrote:
> programmernovice wrote:
>
> > Hi, I'd like to learn to extract information from websites
> > automatically, the information is available manually but it takes a
> > long time to extract this way.  I know a few basics such as looping,
> > "if" statements, etc, but unfortunately learned programming in the
> > Fortran days!  I know nothing about "object-based languages"
>
> One wonders what information you seek to extract.
>
> If you only need raw text, get a program called "lynx" and use its -dump
> command. It converts a web page to raw text, without HTML, and with only
> linefeeds for formatting.
>
> If, however, you expect to use the HTML codes to detect the locations of
> certain data, you need an HTML parser. Get HttpUnit for Java, or WebUnit for
> Ruby, and write "test cases" for the victim^W target web pages.
>
> Here's a sample of HttpUnit doing what you need to the Dilbert web site:
>
> http://www.c2.com/cgi/wiki?HttpUnitTutorial
>
> Now understand that sample targets people already stuck with Java. It is a
> very sad, complex language with a difficult learning curve and few rewards,
> so try the Ruby version next. Install Ruby, use "gem install WebUnit" or
> similar to install WebUnit, and there you go.
>
> Scraping a web site is remarkably similar to testing one, so if you stay
> within the web testing community you should easily find more help here.
> 
> -- 
>   Phlip
Thanks for replying.

.