Subj : Re: Get URLs
To   : "John J. Lee" <jjl@pobox.com>
From : Brendan Eich <brendan@meer.net>
Date : Tue Oct 07 2003 09:10 pm

John J. Lee wrote:

> Brendan Eich <brendan@meer.net> writes:
> 
>>You can't execute the script on the server side.  It needs the
> 
> [...]
> 
> Who said anything about a server?  Did I miss it?

Berurier wrote:

 > Hello,
 >
 > I make a crawler which must *parse* javascript


Crawler, server; potato, potahto.  My point was that neither a crawler 
nor a server is the client JS environment, with a client DOM, events, 
cookies.txt, and a real user sitting there with whom to interact.


>>you need to run
>>in the client.  Or you need a proxy between client and server.
> 
> 
> For which XPCOM is what you want (or other browsers' automation
> interfaces: MSIE has COM automation interfaces, Konqueror has KParts
> and DCOP).


I meant by proxy something more like an HTTP proxy.  In other words, 
reverse the sense of the crawl, and wait for URIs (computed however) to 
come to the proxy.  Don't try to perform a "static" or even "dynamic" 
(but offline, crawler-side if not server-side) analysis of the JS to 
hope to discern URI strings.


> There are other ways, too: client libraries like HttpUnit (in Java).
> Swings and roundabouts.


Right, but the URI(s) to be filtered, if computed based on client 
per-user state or preferences, can't be computed otherwise.  So if the 
idea is to deal with any contingent URI, crawling may not be the best 
way to go.

But I admit I'm not sure what manulenantais@yahoo.fr wanted to do with 
those URI strings found by his crawler.  What's more, I'm sorry to say I 
confused him with emmanuel.fernandez@crf.canon.fr, but now I'm not sure 
they're the same person.

So instead of guessing, I'm asking: what is the crawler trying to 
discover?  The set of all possible URIs that clients may request based 
on the JS and HTML content?  Or just the ones easily found by static 
analysis, or even "default user" dynamic analysis (running the JS in the 
crawler, with default user cookie and other settings)?

/be

.