Subj : Re: Get URLs To : netscape.public.mozilla.jseng From : jjl@pobox.com (John J. Lee) Date : Wed Oct 08 2003 02:43 pm Brendan Eich writes: > John J. Lee wrote: > > > Brendan Eich writes: > > > >>You can't execute the script on the server side. It needs the > > [...] > > Who said anything about a server? Did I miss it? > > Berurier wrote: > > > Hello, > > > > I make a crawler which must *parse* javascript > > > Crawler, server; potato, potahto. My point was that neither a crawler > nor a server is the client JS environment, with a client DOM, events, > cookies.txt, and a real user sitting there with whom to interact. We must have rather different definitions of server and client, I think (and maybe of crawler). In my mind, it's Crawler, client, tomato, tomahto. If you're fetching pages by HTTP with GET and POST, are you not a client? That's basically what a crawler does, in my definition of the word (with a suggestion that you're recursively following links, and that there's not much interactive UI to speak of -- at least not one that allows you to directly interact with web pages). > >>you need to run > >>in the client. Or you need a proxy between client and server. > > For which XPCOM is what you want (or other browsers' automation > > interfaces: MSIE has COM automation interfaces, Konqueror has KParts > > and DCOP). > > > I meant by proxy something more like an HTTP proxy. In other words, > reverse the sense of the crawl, and wait for URIs (computed however) Well, we *are* guessing, but it seems an unlikely thing to want to get the URLs for their own sake. Most likely he only wants them so he can immediately fetch the content they point to. > > There are other ways, too: client libraries like HttpUnit (in Java). > > Swings and roundabouts. > > > Right, but the URI(s) to be filtered, if computed based on client > per-user state or preferences, can't be computed otherwise. So if the Perhaps you have the tacit assumption that the crawler won't have such per-user state. Frequently, they do. Even wget, a general purpose web-scraper, can handle cookies... > idea is to deal with any contingent URI, crawling may not be the best > way to go. [...] Cookies can be handled without any knowledge of particular URIs, of course. And there do exist many tiny little web clients that are special-purpose things designed to scrape a particular page for data, and know about the particular forms they have to fill in, or whatever. No reason why they can't handle JS -- and frequently they do. John .