Subj : Re: Get URLs
To   : netscape.public.mozilla.jseng
From : jjl@pobox.com (John J. Lee)
Date : Wed Oct 08 2003 02:43 pm

Brendan Eich <brendan@meer.net> writes:

> John J. Lee wrote:
> 
> > Brendan Eich <brendan@meer.net> writes:
> >
> >>You can't execute the script on the server side.  It needs the
> > [...]
> > Who said anything about a server?  Did I miss it?
> 
> Berurier wrote:
> 
>  > Hello,
>  >
>  > I make a crawler which must *parse* javascript
> 
> 
> Crawler, server; potato, potahto.  My point was that neither a crawler
> nor a server is the client JS environment, with a client DOM, events,
> cookies.txt, and a real user sitting there with whom to interact.

We must have rather different definitions of server and client, I
think (and maybe of crawler).  In my mind, it's Crawler, client,
tomato, tomahto.

If you're fetching pages by HTTP with GET and POST, are you not a
client?  That's basically what a crawler does, in my definition of the
word (with a suggestion that you're recursively following links, and
that there's not much interactive UI to speak of -- at least not one
that allows you to directly interact with web pages).


> >>you need to run
> >>in the client.  Or you need a proxy between client and server.
> > For which XPCOM is what you want (or other browsers' automation
> > interfaces: MSIE has COM automation interfaces, Konqueror has KParts
> > and DCOP).
> 
> 
> I meant by proxy something more like an HTTP proxy.  In other words,
> reverse the sense of the crawl, and wait for URIs (computed however)

Well, we *are* guessing, but it seems an unlikely thing to want to get
the URLs for their own sake.  Most likely he only wants them so he can
immediately fetch the content they point to.


> > There are other ways, too: client libraries like HttpUnit (in Java).
> > Swings and roundabouts.
> 
> 
> Right, but the URI(s) to be filtered, if computed based on client
> per-user state or preferences, can't be computed otherwise.  So if the

Perhaps you have the tacit assumption that the crawler won't have such
per-user state.  Frequently, they do.  Even wget, a general purpose
web-scraper, can handle cookies...


> idea is to deal with any contingent URI, crawling may not be the best
> way to go.
[...]

Cookies can be handled without any knowledge of particular URIs, of
course.  And there do exist many tiny little web clients that are
special-purpose things designed to scrape a particular page for data,
and know about the particular forms they have to fill in, or whatever.
No reason why they can't handle JS -- and frequently they do.


John

.