Newsgroups: news.software.b
Path: utzoo!sq!lee
From: lee@sq.sq.com (Liam R. E. Quin)
Subject: Re: Modifying news storage for fast searches
Message-ID: <1989Nov26.032124.680@sq.sq.com>
Reply-To: lee@sq.com (Liam R. E. Quin)
Organization: Unixsys (UK) Ltd
References: <51195@looking.on.ca> <2179@prune.bbn.com>
Date: Sun, 26 Nov 89 03:21:24 GMT

In article <2179@prune.bbn.com> rsalz@bbn.com (Rich Salz) writes:
>In <51195@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes:
>>Well, if news.software.b is any indication, a fast search system for
>>News might not be too bad.
Do you mean fast for users of for programs, I wonder...

>>Another idea is to store articles in a special compressed form that lists
>>the dictionary first (ie. the list of words) followed by the text expressed
>>and indices into the word list.
This turns out to have problems.
The average word-length in English is about 4.5 characters, plus 1 or more
for the space between words.  So you have to use less than 4.5 bytes to
store each word, and still retain the punctuation (which is very important
in fragments of C code, of course!).
Now, there are an estimated million users, each with a distinct host and
username.  So the vocabulary would be a little large.  And if peeple make
as many speling mistakes and typos as I doo [:-)], the vocabulary gets even
larger.  Most "words" are used less than ten times!
So, you would probably need 4 bytes-worth of word-number, and the saving turns
out to be less than you would like.
See below for some other approaches, though.

>Free-text retrieval is basically a solved problem.  Go buy books by
>(Gerald?) Salton.
I should point out that there is still a lot of research in this area.

>Check out your Unix documentation for "Some Examples of Inverted Indices
>on the Unix System" by Mike Lesk (USD:30 in the BSD docs, I don't know
>where for other systems -- 2B for Version 7, I think).
Refer is not distribnuted with System V.

>There was a mini text-retrieval system that appeared in comp.sources.misc
>qndxr I think the name was.  There will be a bigger system in c.s.unix in
>a couple of weeks.
Well, as soon as I finish the Sun port.  It is ready now on 386/ix, except
that I need to take the sun version and check that it still works on the 386.
MIPS and Cray patches to follow :-) :-) :-)

I have not been able to try it on news because of disk space, however.
If anyone cares to donate an eagle or three, I will make a news-reader
interface.  [0.5 :-)]

>Associative retrieval -- "give me more articles like THIS one" was first
>proposed in the 1950's.  Thinking Machines has one hell of a sexy demo [...]
This is rather tricky.  TOPIC is a commercial package which does this.  It can
cost of the order of UK#250,000 for a typical 64-user Sequent Symmetry.

The difficulty is in finding unusual words and phrases.
If you ask for articles containing "unix" anywhere in the body, for instance,
you will get an awful lot of responses!
There are also disk space issues.  Most articles expire after less than a day
or so here as it is.

>To follow the Usenet trend of "I said it first," I guess I should say
>that I proposed this on the news-interfaces list nearly a year ago.
>	/r$
Well, what happened to "knews" back in 1985 or so?

* Footnote: Some other approaches to retrieval

(1) signatures -- this is basically hashing, and can dramatically reduce
   search time.  There is generally a 10% space overhead.
   Most current schemes (PAT et. al.) do not cope with deleting and adding
   files at 10 minute intervals.  PAT takes a weekend on a Sun 4 to index
   one ~500 Meg file, which would be a little irritating for six month's
   news.

(2) refer uses something like superimposed signatures, see Mike Lesk's paper.
   But it turns out that grep tehnology has caught up with refer now.

(3) Fulcrum use a byte index, which allows *much* more sophisticated
    searching.  Unfortunately, this algorithm is not (as I understand it)
    publicly available.

If the original poster wants to discuss the algorihms I chose in implementing
a fairly simple prototype text retrieval package, I would be happy to do so
in mail.

Lee
-- 
Liam R. Quin, Unixsys (UK) Ltd [note: not an employee of "sq" - a visitor!]
lee@sq.com (Whilst visiting Canada from England, until Christmas)
utai!anduk.uucp!lee (after Christmas)
 ...striving to promote the interproduction of epimorphistic conformability
