From hunsucker@uba.uva.nl Mon Sep 11 06:50:00 2000 Received: from mxu4.u.washington.edu (mxu4.u.washington.edu [140.142.33.8]) by lists.u.washington.edu (8.9.3+UW00.05/8.9.3+UW99.09) with ESMTP id GAA150366 for ; Mon, 11 Sep 2000 06:49:59 -0700 Received: from barlaeus.ic.uva.nl (barlaeus.ic.uva.nl [145.18.68.50]) by mxu4.u.washington.edu (8.9.3+UW00.02/8.9.3+UW99.09) with ESMTP id GAA05769 for ; Mon, 11 Sep 2000 06:49:58 -0700 Received: from S879.uba.uva.nl (L-Hunsucker.uba.uva.nl [145.18.84.178]) by barlaeus.ic.uva.nl (8.9.3/8.9.3) with SMTP id PAA27124 for ; Mon, 11 Sep 2000 15:49:57 +0200 (MET DST) X-Authentication-Warning: barlaeus.ic.uva.nl: Host L-Hunsucker.uba.uva.nl [145.18.84.178] claimed to be S879.uba.uva.nl Message-Id: <3.0.6.32.20000911155458.00964bd0@mail.uba.uva.nl> X-Sender: hunsucke@mail.uba.uva.nl X-Mailer: QUALCOMM Windows Eudora Light Version 3.0.6 (32) Date: Mon, 11 Sep 2000 15:54:58 +0200 To: classics@u.washington.edu From: "R.L. Hunsucker (UvA/UBA)" Subject: Re: Philological reference works on the Web In-Reply-To: <200009110941.e8B9fns01267@darwin.helios.nd.edu> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable I read with interest the informative contribution by AMK, but continue undiminishedly to wonder whether the answer to Daniel Ria=F1o Rufilanchas' dilemma (a dilemma for all of us in fact) doesn't lie elsewhere than in the reactive and neutral efforts past and future to crawl, store, and perhaps also index the the whole free-for-all that the Internet is. Isn't what we need a more discriminating and maybe even normative approach? And among the parties that have to be involved in establishing and maintining such an approach are certainly those two (actually one in the same) which Daniel had in mind: the scholar/researcher/teacher etc. as (user and) citer, and the scholar/researcher/teacher etc. as verifier (and user) of documents offered (only) in the network environment. And it isn't necessarily so that the responsibility and effort must reside at the output (and use) end, rather than at least partially with the supply side -- and/or in the intermediation mechanism between the two. Furthermore isn't it the case that much of just the sort of resource that Daniel had in mind, proprietary and protected as it is, lies out of the reach of your typical bots out there on the Web? If it were only so simple. We're gonna need more than Moore's Law, I'm afraid. - - - - - - - - - - - - - - - - - Laval Hunsucker hunsucker@uba.uva.nl ------------------------------ #################### At 04:41 11-9-00 -0500, Alfred M. Kriman wrote: >[ . . . ] >Brewster Kahle, who invented WAIS (Wide Area Information Server) protocol, >sold it for $15 million to AOL long after it was made substantially >obsolete by HTTP. With the money, he started a nonprofit venture to >archive the web. Eventually, this led to the creation of an AI-ish >search-engine component for browsers -- ``What's related'' and other, >more sophisticated browsing tools called Alexa [[1]]. This was >described in _The Chronicle of Higher Education_, 6 March 1998, in an >article by Jeffrey Selingo [[2]]. Since that time, Alexa has been >bought by Amazon, but continues to provide its data free to the Internet >Archive [[3]], which I guess is the archive component of the initial >venture. As of now, Alexa is the only systematic source of data to the >archive, and it completes a circuit every two months (crawls are >transfered to the archive after six months). It had 14 TB of material >logged as of last March (i.e., that ought to be available now). The >archive also accepts data donations. =20 > >``Like a paper library, we provide free access to researchers, >historians, and scholars.'' Older crawls aren't available yet, and >most of the current uses of the archive seem frivolous, but it seems >promising. Visit the archive.org site for links to papers on the >subject of this thread. > >Note that giving the precise moment when you browsed a page is not enough >for two reasons: (1) you may have browsed a copy from a stale cache, and >(2) even if your download was fresh, it might differ from pages downloaded >in the previous and subsequent crawls of an archiving spider. The only >chance you have of identifying a version that might be authenticated by >an archive is to save the last-modified-date. > >Last-modified-date (and time) is provided by the majority of http >servers, though browsers don't normally display the information (they do >use it, clumsily, to check whether locally cached copies are stale). I >don't know of anyone who now does it, but (with a little javascripting >or by murmuring http to the server via telnet, say) you can get the >modification date information that you need for a precise reference. >I presume the same datum is being preserved with archive copies (if >only because the spider needs the information to function efficiently >and not recopy unchanged files). > >My own impression from hit statistics is that the Alexa spiders are >crawling much less vigorously than they used to (or keeping up less >well). [A look at Alexa's most-popular-sites list also suggests that >Alexa now has a somewhat unrepresentative user base concentrated in >(R. of) Korea and Japan. However, the popularity rankings and archive >circuit are apparently substantially independent, so this shouldn't >matter.] They stopped archiving images at the end of 1998. It's hard >to keep up. Data on dynamic pages must be a problem too. > >The search engine google.com makes easily available the cached copies >of pages it has indexed (with images). However, this is of almost no >use archivally, because they crawl frequently and apparently discard >older versions. > >There are lots of spiders I don't recognize, and any number of them >might be archiving for long-term documentary purposes. It's hard to >tell who and how many are watching. Case in point: although there >are only two publicly accessible archives for this (classics) list, >I am aware of at least half-a-dozen local (personal, course-related, >or campus) archives or mirrors of the list, with no reason to doubt >that there are many others. > >[[1]] >http://www.alexa.com > >[[2]] >http://chronicle.com/data/articles.dir/art-44.dir/issue-26.dir/26a02701.htm > >[[3]] >http://www.archive.org/ > > > .