Subj : Re: Novice needs help! To : comp.programming From : programmernovice Date : Tue Aug 09 2005 09:33 am Phlip wrote: > programmernovice wrote: > > > Hi, I'd like to learn to extract information from websites > > automatically, the information is available manually but it takes a > > long time to extract this way. I know a few basics such as looping, > > "if" statements, etc, but unfortunately learned programming in the > > Fortran days! I know nothing about "object-based languages" > > One wonders what information you seek to extract. > > If you only need raw text, get a program called "lynx" and use its -dump > command. It converts a web page to raw text, without HTML, and with only > linefeeds for formatting. > > If, however, you expect to use the HTML codes to detect the locations of > certain data, you need an HTML parser. Get HttpUnit for Java, or WebUnit for > Ruby, and write "test cases" for the victim^W target web pages. > > Here's a sample of HttpUnit doing what you need to the Dilbert web site: > > http://www.c2.com/cgi/wiki?HttpUnitTutorial > > Now understand that sample targets people already stuck with Java. It is a > very sad, complex language with a difficult learning curve and few rewards, > so try the Ruby version next. Install Ruby, use "gem install WebUnit" or > similar to install WebUnit, and there you go. > > Scraping a web site is remarkably similar to testing one, so if you stay > within the web testing community you should easily find more help here. > > -- > Phlip Thanks for replying. .