Subj : Re: Get URLs
To   : "John J. Lee" <jjl@pobox.com>
From : Brendan Eich <brendan@meer.net>
Date : Wed Oct 08 2003 09:51 am

John J. Lee wrote:

>>Crawler, server; potato, potahto.  My point was that neither a crawler
>>nor a server is the client JS environment, with a client DOM, events,
>>cookies.txt, and a real user sitting there with whom to interact.
> 
> We must have rather different definitions of server and client, I
> think (and maybe of crawler).  In my mind, it's Crawler, client,
> tomato, tomahto.


You're right, of course ;-).  I was exaggerating to make a semi-humorous 
point.  The distinction I wanted to make is between bona-fide, UI-driven 
human-operated client and fake- or non-client.


> Well, we *are* guessing, but it seems an unlikely thing to want to get
> the URLs for their own sake.  Most likely he only wants them so he can
> immediately fetch the content they point to.


I don't want to guess, but the fact remains that often enough, a URI 
that's computed takes into account some client state the crawler can't 
simulate.  If we're not worried about computed URIs, then the "grep" 
approach wins, and the value of the crawler running JS just to catch the 
contrived "http://" + "rest/of/uri" cases is very low.


> Perhaps you have the tacit assumption that the crawler won't have such
> per-user state.  Frequently, they do.  Even wget, a general purpose
> web-scraper, can handle cookies...


Sure, but only for the "default" user.

I don't want to belabor the point that computed URIs may depend on 
client state that varies unpredicatably per user, and even per 
interactive session.  That may not be a concern for this crawler case. 
It depends on the reason the crawler wants the URIs.  I wonder what that 
  reason is, still.

/be

.