Subj : Re: Novice needs help! To : comp.programming From : Phlip Date : Mon Aug 08 2005 11:33 pm programmernovice wrote: > Hi, I'd like to learn to extract information from websites > automatically, the information is available manually but it takes a > long time to extract this way. I know a few basics such as looping, > "if" statements, etc, but unfortunately learned programming in the > Fortran days! I know nothing about "object-based languages" One wonders what information you seek to extract. If you only need raw text, get a program called "lynx" and use its -dump command. It converts a web page to raw text, without HTML, and with only linefeeds for formatting. If, however, you expect to use the HTML codes to detect the locations of certain data, you need an HTML parser. Get HttpUnit for Java, or WebUnit for Ruby, and write "test cases" for the victim^W target web pages. Here's a sample of HttpUnit doing what you need to the Dilbert web site: http://www.c2.com/cgi/wiki?HttpUnitTutorial Now understand that sample targets people already stuck with Java. It is a very sad, complex language with a difficult learning curve and few rewards, so try the Ruby version next. Install Ruby, use "gem install WebUnit" or similar to install WebUnit, and there you go. Scraping a web site is remarkably similar to testing one, so if you stay within the web testing community you should easily find more help here. -- Phlip http://www.greencheese.org/ZeekLand <-- NOT a blog!!! .