README freeWAIS-0.2-sf

Authors 
   Thinking Machines, Jim Fullton, Kevin Gamiel, Jane Smith, Tung Huynh, 
   Ulrich Pfeifer 
Last update 
   27.6.1994 
Abstract 
   This file contains overview information about freeWAIS 0.2 sf 
Send comments and bug fixes to: 
   Huynh Quoc Thanh Tung <huynh1@ls6.informatik.uni-dortmund.de> 
For problems concerning the installation send mails to: 
   Ulrich Pfeifer <pfeifer@ls6.informatik.uni-dortmund.de> 

Notes on the beta version:

   All tests should succeed on: 

     Machine       Operating system     Compiler
     -------       ----------------     --------
     Linux         Linux 1.0.8.(Posix)  gcc version 2.5.8
     SUN SPARC     SunOS Release 4.1.3  gcc version 2.4.3
     DEC alpha     DEC OSF/1 V1.2       DEC OSF/1 C Compiler
     IBM RS6000    IBM-RS/6000-580 AIX  IBM AIX XL C Compiler/6000
     Solaris       Solaris 2.3          gcc version 2.5.6

   new term weighting formula which should improve the retrieval quality. (see
   TERM WEIGHTING) 
   If a file does'nt not exist the indexer return a error message but not coredump. 

I'd like to thank Alf-Christian Achilles <achilles@ira.uka.de> for providing regexp
library and very helpful comments. More information is in the History file. 

Known Bugs

 1. missing Files 
 2. DICTIONARY_TOTAL_SIZE_WORD 
 3. Number of Blocks in Dictionary 
 4. Headlines Fri Jun 3 14:09 
 5. strdup missing 
 6. better support for doctype URLFri Jun 10 13:50 
 7. strncmp for 8-bit chars 
 8. missing install.sh 
 9. missing parameter field_id 
 10. catalog (file.cat) is empty 
 11. stemming by field searching 
 12. bug by creating more than 9 fields 
 13. bug by numeric searching with the operator ">" 

   The distribution now (Tu Jul 5) contains solutions of these bugs.

 14. bug with synonym file 
 15. bug with synonym file (free Error) 

freeWAIS-sf an gopher

It is possible to link freeWAIS-sf in gopher. Thanks to Svein Parnas
<Svein.Parnas@rbt.no> for advice. 

Introduction

Previously, it was not possible to make a search for fields, e.g. ti = information or au
= martin. Now this freeWAIS-0.2-sf version supports fields and numeric concept. 

Because the various documents do not have an overall-structure it is necessary to
define a document specification which describes in which way you can index your
documents, how you can generate fields for your documents and which layout your
document headlines have. The lines of the documents for which the document
description do not contain will be ignored. 

For each field which you want to create a dictionary file and a inverted file will be
generated. The fields are inserted in .src. If you want to describe the fields you can
create a file with name .fde, e.g: 

<databse>.fde:
               py: publication year
               au: author
               ti: title
               jt: journal title
               ck: citation key

By default the fields are inserted in <database>.src without description. 

What is the document specification ?

DOCUMENT SPEC SYNTAX


   +------------------------------------------------------------+
   |  format      -> <record-end> regexp speclist               |
   |  speclist    -> spec | spec speclist                       |
   |  spec        -> <field> REGEXP regexp                      |
   |                         fieldlist                          |
   |                         options                            |
   |                         indexspecs                         |
   |                 <end>   regexp                             |
   |  options     -> '' | option options                        |
   |  options     -> NUMERIC regexp INT                         |
   |                 HEADLINE regexp INT                        |
   |                 DATE REGEXP REGEXP date date date regexp   |
   |  indexspecs  -> '' | indexspec indexspecs                  |
   |  indexspec   -> indextype dicts                            |
   |  indextype   -> TEXT | SOUNDEX | PHONIX                    |
   |  dicts       -> GLOBAL | LOCAL | BOTH                      |
   |  date        -> DAY | MONTH monthspec | YEAR               |
   |  monthspec   -> '' | STRING                                |
   |  fieldlist   -> '' | WORD fieldlist                        |
   +------------------------------------------------------------+


REGULAR EXPRESSION SYNTAX

        +--------------------------------------------------+
        | Operator        Meaning                          |
        |--------------------------------------------------|
        | x               the character "x"                |
        | "x"             an "x", even if x is an operator |
        | \x              an "x", even if x is an operator |
        | [xy]            the character x or y             |
        | [x-z]           the characters x, y or z         |
        | [^x]            any character but x              |
        | .               any character but newline        |
        | ^x              an x at the beginning of a line  |
        | x$              an x at the end of a line        |
        | x?              an optional x                    |
        | x*              0,1,2, ... instances of x        |
        | x+              1,2,3, ... instances of x        |
        | x|y             an x or a y                      |
        | (x)             an x                             |
        | x{m,n}          m through n occurrences of x     |
        +--------------------------------------------------+

I think that it is more simple to explain this document specification with an example: 

DOCUMENT SPEC EXAMPLE

CK: Mostert/etal:89
AU: Mostert, D.N.J.; Eloff, J.H.P.; von Solms, S.H.
TI: A Methodology for Measuring User Satisfaction.
JT: Information processing & management.
ED: JAN-01-1994
VO: 25
PY: 1989
NO: 5
PP: 545
^L
CK: Qiu:90
AU: Qiu, Liwen
TI: An Empirical Examination of the Existing Models for Bradford's Law.
JT: Information processing & management.
ED: JAN-01-1994
VO: 26
PY: 1990
NO: 5
PP: 655
^L



------------------------------------------------+---------------------------------------
<record-end> /^L/                               | records are separated by form feeds
                                                | (Cntrl-L not '^L' !)
                                                *
<layout>                                        | 
<headline> /^TI: / /^[A-Z][A-Z]:/ 50 /TI: /     | line which starts with 'TI: '
                                                | and ends with /^[A-Z][A-Z]:/
                                                | first 50 chars after 'TI: ' are copied
                                                | to the chars 1 to 50 of the headline.
                                                * 
<headline> /^AU: / /^[A-Z][A-Z]:/ 50 /AU: /     | line which starts with 'AU: '
                                                | and ends with /^[A-Z][A-Z]:/
                                                | first 50 chars after 'AU: ' are copied
                                                | to chars 51 to 100 of the headline.
                                                *
<date> /^ED: / /%s-%d-%d/ month string day year | line starts with /^ED: /
/^ED: [^ ]/                                     | /%s-%d-%d/ is sscanf_argument
                                                | Month is a string 
                                                | (nummber by default if you don't type 'string')
                                                | after month is day, then year.
                                                | /^ED: [^ ]/ is the begin of index position.
<end>                                           | end of layout.
                                                *
<field> /^PY: /                                 | It is a numeric field of length 4,
py <numeric> /^PY: [^ ]/ 4 TEXT LOCAL           | begin at first number of PY, e.g
<end> /^[A-Z][A-Z]:/                            | if the number is  1990 then the regexp
                                                | /^PY: [^ ]/ means ^ here is the begin of
                                                | the number (begin of line by default)
                                                | indexed with type TEXT in the local
                                                | dictionary only and ends with the 
                                                | next tag.
                                                *
<field> /^AU: /                                 | field 'au' is indexed with types
au SOUNDEX LOCAL TEXT LOCAL                     | TEXT and SOUNDEX in the local
<end> /^[A-Z][A-Z]:/                            | dictionary.
                                                *
<field> /^CK: /                                 | field 'ck' is indexed with type text
ck TEXT BOTH                                    | in the local and the global dict.
<end> /^[A-Z][A-Z]:/                            |
                                                *
<field> /^TI: /                                 | field 'ti' is indexed with type text
ti stemming TEXT BOTH                           | in the local and the global dict.
<end> /^[A-Z][A-Z]:/                            | 'stemming' indicate that the stemmer is to
                                                | call for this field (no stemming by default).
                                                *
<field> /^AU: /                                 | field 'au' is indexed with type text 
au TEXT BOTH                                    | in the local and the global dict.
<end> /^[A-Z][A-Z]:/                            |
                                                *
<field> /^JT: / /^JT: [^ ]/                     | fields 'jt' and 'jt' are indexed with type 
ti jt TEXT BOTH                                 | text in the local and the global dict.
<end> /^[A-Z][A-Z]:/                            | The begin is at the first character after
                                                | this regexp /^JT: [^ ]/ (optional,
                                                | begin of line by default), e.g 
                                                | JT: Information processing & management.
                                                |     ^ here is the beginning to index.
                                                *
<field> /^AU: /                                 | line which begins with the regexp /^AU: /
TEXT GLOBAL                                     | should be indexed only in global dictionary.
<end> /^[A-Z][A-Z]:/                            |
------------------------------------------------+----------------------------------------------

equivalent (with old headline definitions)

-----------------------------------------+------------------------------
<record-end> /^L/                        | records are separated by form feeds
                                         | (Cntrl-L not '^L' !)
<field> /^PY: /                          | field 'py' starts with 'PY:'.
py <numeric> /^PY: [^ ]/ 4 TEXT LOCAL    | It is a numeric field of length 4,
                                         | begin at first number of PY, e.g
                                         | if the number is  1990 then the regexp
                                         | /^PY: [^ ]/ means ^ here is the begin of
                                         | the number (begin of line by default)
<end> /^[A-Z][A-Z]:/                     | indexed with type TEXT in the local
                                         | dictionary only and ends with the 
                                         | next tag.
                                         |
<field> /^AU: /                          | field 'au' is indexed with types
au SOUNDEX LOCAL TEXT LOCAL              | TEXT and SOUNDEX in the local
<end> /^[A-Z][A-Z]:/                     | dictionary.
                                         |
<field> /^CK: /                          | field 'ck' is indexed with type text
ck TEXT BOTH                             | in the local and the global dict.
<end> /^[A-Z][A-Z]:/                     |
                                         |
<field> /^TI: /                          | first 50 chars after 'TI: ' are copied
ti <headline> /TI: / 50                  | to the chars 1 to 50 of the headline.
<date> /^ED: / /%s-%d-%d/                | The headlines should have date. In this
month string day year /^ED: [^ ]/        | example a date is JAN-01-1994. Month is
stemming TEXT BOTH                       | a string (nummber by default if you don't
<end> /^[A-Z][A-Z]:/                     | type 'string'), after month is day, then year.
                                         | /%s-%d-%d/ is sscanf_argument. /^ED: [^ ]/ 
                                         | is the begin of index position. 'stemming' 
                                         | indicate that the stemmer is to call for
                                         | this field (no stemming by default).
                                         | 
<field> /^AU: /                          | first 50 chars after 'AU: '
au <headline> /AU: / 50 TEXT BOTH        | are copied to chars 51 to 
<end> /^[A-Z][A-Z]:/                     | 100 of the headline.
                                         |
<field> /^JT: / /^JT: [^ ]/              | fields 'jt' and 'jt' are indexed with type 
ti jt TEXT BOTH                          | text in the local and the global dict.
<end> /^[A-Z][A-Z]:/                     | The begin is at the first character after
                                         | this regexp /^JT: [^ ]/ (optional, 
                                         | begin of line by default), e.g 
                                         | JT: Information processing & management.
                                         |     ^ here is the begin to index.
<field> /^AU: /                          |
TEXT GLOBAL                              | line which begins with the regexp /^AU: /
<end> /^[A-Z][A-Z]:/                     | should be indexed only in global dictionary.
                                         |
-----------------------------------------+------------------------------

NOTE

   map of ASCII character set: \A = ^A (ctrnl A), \B = ^B, .... \J = '\n' (newline). 
   If the end_tag is at the end of line you must type a newline character '\n' into
   the regexp for the end_tag. 

Example

Document specification for 

        <TI:> Information Retrieval <TI:>
   
        <field> /^<TI:>/
        ti TEXT LOCAL
        <end> /$<TI:>\n/

   If a separator is a empty line the regexp for this is /^\n$/ 
   The length of a headline is 100 characters. If you want to change the length of
   headline update MAX_HEADER_LEN (default MAX_HEADER_LEN =
   100) in ircfiles.c. 
   You have 3 indixes: LOCAL, GLOBAL, BOTH 

   LOCAL 
      If you want only to insert some words in some fields. 
   GLOBAL 
      If you want only to insert some words in the database global 
   BOTH 
      for LOCAL and GLOBAL 


You may want to play with the document spec in "test.fmt". See the example in dir
./FIELD-EXAMPLE. Just type "make" to index and "make test" to call "swais". 

Features of this version

SEARCHES 
   in restricted field now are possible 
DOCUMENT TYPES 
   can be defined now with a special language. No code modification for adding
   new document types are necessary. Definition of fields by regexps. For each
   input field the type of indexes (text, soundex, phonix) can be specified. 
BOOLEAN SEARCHES 
   really work now. The flipside is, that you can retrieve 'SYNTAX ERRORS'. 
LITERAL and PARTIAL SEARCHES 
   really work now. 
NEW TERM WEIGHTING 
STEMMING 
   can be switched on for individual fields 

How can you use this version ?

Call waisindex with option '-t fields'. Document specification should be in
"<database>.fmt". 

1) Define a document specification with name <database>.fmt. 

2) waisindex -d index_filename -t fields filename . 

NOTE

   Of cours, you can use other options too, e.g, waisindex -d index_filename -t
   fields -r filename. 
   If you want to create only one field, but the old fields should not be deleted
   you can use the option -nfields. In the document specification you must add
   new fields which you want to index. 

      Example:
      -------- ........ 
               ........
               ........ above
               .          
               <field> /^AU: /
               names TEXT LOCAL | BOTH
               <end> /^[A-Z][A-Z]:/


   Only field 'names' is to create. 
   If you want to specifify the headline corresponding to the format defined, e.g.
   (irlist, mail_or_rmail, etc.), and don't want to use the standard field format for
   headlines you must call this: - waisindex -d test -t fields -t mail_or_rmail
   TEST. (-t mail_or_rmail must be after -t fields !!!!) 



How can you make a search query ?

QUERY SYNTAX

      query          -> expression
      expression     -> term
                        expression OR term
                        expression term                   # OR may be ommited
      term           -> factor
                        term AND factor
                        term NOT factor                   # NOT really means
                                                          # AND NOT
      factor         -> word
                        (  expression )
                        field   =  ( s_expression )
                        field   =  word
                        field   = phonix|soundex word     # phonix or soundex search
                        field   == word                   # for numeric fields
                        field   <  word     
                        field   >  word

      # same as above, but no field spec is allowed, since one is
      # given already
      s_expression   -> s_term
                        s_expression OR s_term
                        s_expression s_term
      s_term         -> s_factor
                        s_term AND s_factor
                        s_term NOT s_factor
      s_factor       -> WORD
                        ( s_expression )


QUERY EXAMPLES

      "information retrieval"                             # free text queries
      "information OR retrieval"                          # same as above
      "ti=information retrieval"                          # information must be in
                                                          # the title
      "ti=(information retrieval)"                        # one of them in title
      "ti=(information AND retrieval)"                    # both of them in title
      "ti=(information NOT retrieval)"                    # "information" in title
                                                          # and "retrieval" not in
                                                          # title
      "py==1990"                                          # nummeric equal
      "py<1990"
      "py>1990"

      "au=(soundex salatan)"                              # soundex search
                                                          # matches eg.
                                                          # 'Salton'
      ti=('information retrieval')                        # literal search
      ti=(information system*)                            # partial search


For more example see XwaisqHELP! 


/*********************       TERM WEIGHTING        ************************
 * The documents would be presented by term vectors of the form
 *       D = (t_0,w_d0; t_1,w_d1; ..., t_t,w_dt)
 * where each t_k identifies a content term assigned to some sample 
 * document and w_dk represents the weight of term t_k in Document D
 * (or query Q). Thus, a typical query Q might be formulated as
 *       Q = (q_0,w_q0; q_1,w_q1; ...; q_t,w_qt)
 * where q_k once again reprents a term assigned to query Q.
 * The weights could be allowed to vary continuosly between 0 and 1, the
 * higher weight assignments near 1 being used for the most important terms,
 * whereas lower weights near 0 would characterize the less important terms.
 * Given the vector representation, a query-document similarity value may
 * be obtained by comparing the corresponding vectors, using for example
 * the conventional vector product formula
 *       similarity(Q,D) = sum(w_qk * w_dk), k=1 to t.
 * 
 * Three factors important for term_weighting:
 * 1) term frequency in individual document (recall)
 * 2) inverse document frequency (precision)
 * 3) document length (vector length)
 * 
 * Term frequency component using here:  new_wgt = 0.5 + 0.5 * tf / max_tf
 * augmented normalized term frequency (tf factor normalized by maximum tf
 * in the vector, and further normalized to lie between 0.5 and 1.0).
 *
 * Collection frequency component using here: 1.0
 * no change in weight; use original term frequency component.
 *
 * Normalization component using here: sqrt(sum(new_wgt^2)) = vector length.
 *
 * Thus, document term weight is: w_dk = new_wgt / vector length
 *
 * By query term weighting it is assumpted that tf is equal 1. So that
 * w_qk = 1.
 *
 ****************************************************************************/

Document term weighting by standard Boolean formulations

Given queries "A or B", "A and B", and "A not B" (A and-not B), a document X with
weights d_A(X) and d_B(X) for terms A and B. The retrieval values are following: 

   d_A(X) + d_B(X) for query (A or B) 
   min(d_A(X, d_B(X) for query (A and B) 
   min(A, 1 - d_B(X)) for query (A not B (A and-not B)) 

Note: If you use these new formula the files inverted has a new structure. 


Term weighting in wais

w_dk = ((log(tf) + 10) * idf) / number_of_terms_in_a_document 

   tf = term frequency. Initially is tf = 5. 
   idf = 1/term_frequency_in_the_collection 

Disadvantages

   For example a database consists of 10 documents. A term which occurs 10
   times in a document has the idf = 1/10. The same term which occurs in 10
   documents has also idf = 1/10. One can say in both cases the term has the same
   relevance. It is not correct. 
   The normalization factor is not the weight of each terms in the document but
   number of terms in a document. 


Note: The changes are restricted to waisindex and waisserver. 

WHERE TO GET

freeWAIS-0.2-sf is here. 

INSTALLATION

Just run the configure script in the Distribution. Then type make install. 

Ulrich Pfeifer, Tue Jun 14 14:09 
