From akriman@darwin.helios.nd.edu Mon Sep 11 02:41:55 2000 Received: from mxu3.u.washington.edu (mxu3.u.washington.edu [140.142.33.7]) by lists.u.washington.edu (8.9.3+UW00.05/8.9.3+UW99.09) with ESMTP id CAA157962 for ; Mon, 11 Sep 2000 02:41:54 -0700 Received: from mailspool.helios.nd.edu (mailspool.helios.nd.edu [129.74.250.7]) by mxu3.u.washington.edu (8.9.3+UW00.02/8.9.3+UW99.09) with ESMTP id CAA00669 for ; Mon, 11 Sep 2000 02:41:54 -0700 Received: from darwin.helios.nd.edu (darwin.helios.nd.edu [129.74.250.114]) by mailspool.helios.nd.edu (8.9.2/8.9.2) with ESMTP id EAA02988 for ; Mon, 11 Sep 2000 04:41:49 -0500 (EST) Received: (from akriman@localhost) by darwin.helios.nd.edu (8.10.1/8.10.1/ND-cluster) id e8B9fns01267 for classics@u.washington.edu; Mon, 11 Sep 2000 04:41:49 -0500 (EST) Date: Mon, 11 Sep 2000 04:41:49 -0500 (EST) From: Alfred M Kriman Message-Id: <200009110941.e8B9fns01267@darwin.helios.nd.edu> To: classics@u.washington.edu Subject: Re: Philological reference works on the Web Daniel Riaņo Rufilanchas asked > Now, one of the best things of on-line databases (as far as the > accuracy of information is concerned) is the fact that the > information they contain can be updated almost permanently. But > precisely that seems to collide with the "stability" requirement of a > source to be considered as a "authority". > ... and R. Laval Hunsucker, in a useful follow-up, commented i.a. > ... and I'm > a bit surprised no one's yet chimed in on-list. I myself would very > much enjoy a lively discussion here (how many weightier problems > are there at the moment for the business of scholarship in our as > in all other fields?), but will leave it for the moment at that. Patrick T. Rourke, for whom this is bread and butter and dessert, signed his penultimate posting > PTR, who has probably taxed folks' patience > enough this week. In any case, I suspect that PTR is constipated with too much to write on the topic. My contribution is finite: DRR: > To mention just one issue involved, one of the exigencies of on-line > information quoting, is giving the exact date (to the minute!) when > the information was mined, but: how can your reader verify that the > data you are providing coincide with the data of the on-line site at > the date it was collected, if the site inform that the data bank was > updated recently? > I imagine many of you have dealt with this issue before and maybe > some of you could give me some idea or solution (or direct me to > previous threads on this issue!) to the problem: how can you give the > reader of on-line quotations and references the possibility of always > consult the state of the referenced on-line authority *as you > consulted it*?: Is there any program / protocol to store / consult > the past states of the contents of any piece of information coming > from on-line resources (typically a Web page)? Brewster Kahle, who invented WAIS (Wide Area Information Server) protocol, sold it for $15 million to AOL long after it was made substantially obsolete by HTTP. With the money, he started a nonprofit venture to archive the web. Eventually, this led to the creation of an AI-ish search-engine component for browsers -- ``What's related'' and other, more sophisticated browsing tools called Alexa [[1]]. This was described in _The Chronicle of Higher Education_, 6 March 1998, in an article by Jeffrey Selingo [[2]]. Since that time, Alexa has been bought by Amazon, but continues to provide its data free to the Internet Archive [[3]], which I guess is the archive component of the initial venture. As of now, Alexa is the only systematic source of data to the archive, and it completes a circuit every two months (crawls are transfered to the archive after six months). It had 14 TB of material logged as of last March (i.e., that ought to be available now). The archive also accepts data donations. ``Like a paper library, we provide free access to researchers, historians, and scholars.'' Older crawls aren't available yet, and most of the current uses of the archive seem frivolous, but it seems promising. Visit the archive.org site for links to papers on the subject of this thread. Note that giving the precise moment when you browsed a page is not enough for two reasons: (1) you may have browsed a copy from a stale cache, and (2) even if your download was fresh, it might differ from pages downloaded in the previous and subsequent crawls of an archiving spider. The only chance you have of identifying a version that might be authenticated by an archive is to save the last-modified-date. Last-modified-date (and time) is provided by the majority of http servers, though browsers don't normally display the information (they do use it, clumsily, to check whether locally cached copies are stale). I don't know of anyone who now does it, but (with a little javascripting or by murmuring http to the server via telnet, say) you can get the modification date information that you need for a precise reference. I presume the same datum is being preserved with archive copies (if only because the spider needs the information to function efficiently and not recopy unchanged files). My own impression from hit statistics is that the Alexa spiders are crawling much less vigorously than they used to (or keeping up less well). [A look at Alexa's most-popular-sites list also suggests that Alexa now has a somewhat unrepresentative user base concentrated in (R. of) Korea and Japan. However, the popularity rankings and archive circuit are apparently substantially independent, so this shouldn't matter.] They stopped archiving images at the end of 1998. It's hard to keep up. Data on dynamic pages must be a problem too. The search engine google.com makes easily available the cached copies of pages it has indexed (with images). However, this is of almost no use archivally, because they crawl frequently and apparently discard older versions. There are lots of spiders I don't recognize, and any number of them might be archiving for long-term documentary purposes. It's hard to tell who and how many are watching. Case in point: although there are only two publicly accessible archives for this (classics) list, I am aware of at least half-a-dozen local (personal, course-related, or campus) archives or mirrors of the list, with no reason to doubt that there are many others. [[1]] http://www.alexa.com [[2]] http://chronicle.com/data/articles.dir/art-44.dir/issue-26.dir/26a02701.htm [[3]] http://www.archive.org/ .