VI.d.3 Building Indexes ----------------------- While it is possible to build a multitude of possible index schemes, a generalized one has been developed by Bruce Tanner of Cerritos College which may be applicable to many basic text searching needs. This index scheme is used by the search engine QUERY.EXE included with this distribution. BUILD_INDEX.EXE must be used independent from the Gopher Server to build the required .IDX and .SEL files used by the QUERY.EXE engine. Define a foreign command symbol for BUILD_INDEX, as in: $ INDEX :== $device:[directory]BUILD_INDEX.EXE Then use the program to construct .SEL and .IDX files for a document: $ INDEX document /switch /switch ... where "document" is the (possibly wildcarded) name of the document(s) to be indexed. This should include the path *as a gopher client* would specify it, including device:[directory] as required. Wildcarding permits ellipsis (...) in the directory specification as well as "*" in the filename. The fully qualified path(s) will appear in all search hit link tuples as part of the selector strings, so accuracy is important if subsequent requests for the hit document segments are to succeed. If the document(s) change often, indexes can be tied to a specific generation of the document by the /VERSION switch. Several switches specify the range delimiters to be applied. Range delimiters cause the document to be divided into manageable chunks for reading purposes. Documents are divided into ranges so that a hit on a keyword in that document returns only the range containing the keyword; if keywords appear in multiple ranges, multiple ranges are offered to the client. The range delimiter switches are /CHARACTER, /DASH, /EQUAL, /FIELD, /FF, /LINE, /PARAGRAPH, and /WHOLE: /CHARACTER=%xn - divide the document whenever a line beginning with the hexadecimal value n is encountered. "%x" is a required part of the specification. /DASH=n - divide the document whenever a line *begins* with n dashes ("-"). Three ("---") is the default. /EQUAL=n - divide the document whenever a line *begins* with n equal signs ("="). Eighty is the default. /FIELD=(POSITION=n, SIZE=m) - divide the document whenever columns n through n+m are not all blank. /FF - divide the document whenever a line beginning with a form feed is encountered. /LINE - divide the document into individual lines. Any search hit returns a single line. /PARAGRAPH - divide the document whenever a blank line is encountered. /WHOLE - *DO NOT* divide the document. Any search hit returns the entire document. /CANDIDATES specifies the file of candidate words to use in the index file. This is the reverse of /NOISE. /CANDIDATES=filename /DEFAULT_TOPIC specifies whether a default topic name will be selected. /DEFAULT_TOPIC - (the default) use the first line of each new topic as the topic name if there are no /TOPIC items that are matched in the text. /NODEFAULT_TOPIC - no default topic name will be read. If there is no topic name selected by /DEFAULT_TOPIC or /TOPIC switches, no selector or index entries will be written. In this way, you can omit header and trailer portions of structured documents. /HELPFILE specifies a file to match the queries "?" and "?help". /HELPFILE=(SELECTOR=str, TITLE=str) SELECTOR=string says that the string should be used as the selector for locating the file. The string should be formatted as desribed for /SELECTOR=(TEXT=str), below. TITLE=string says that the string should be used as the title for the file in hit lists returned for queries with "?" or "?help". /KEYWORD determines what part of the input file will be indexed. By default, (assuming no /NOISE or /CANDIDATES) every word is indexed. /KEYWORD=(TEXT=str, EXCLUDE, OFFSET=n, END=str) TEXT=string - where to start in indexing. EXCLUDE - exclude the contents of the TEXT string from the index. OFFSET=n - says to skip 'n' characters after a keyword match before indexing. END=string - says to stop indexing at "string". If END is not given, indexing will stop at the end of the line. If END is given but not matched, indexing will end at the end of the article. /LINK specifies whether to create index/selector (.IDX/.SEL) or Gopher linkage (.LINK) from the source document. /LINK - create a .link file /NOLINK - (the default) create .IDX/.SEL files. By default, /LINK displays selectors in the order found in the source file. /LINK=SORT displays selectors in sorted order. To create files with leading periods, use /OUTPUT=.LINK/LINK. /MAX_TOPICS defines the size of the field that holds the article number. The default (6) holds 999,999 articles per index. /MINIMUM_WORD defines the minimum size of a 'word' that will be indexed. The default (3) automatically eliminates all 1 and 2 character words from the index. /MINIMUM_WORD=n - defines the smallest word to index. /NOISE specifies the file of 'noise' words to omit from the index file. /NOISE=filename If no file is specified, GOPHER_ROOT:[000000]_NOISE_WORDS.DAT is used. /NUMBERS specifies whether to index numeric data. /NUMBERS (the default) says to index numbers. /NONUMBERS says to exclude numbers from the index. /OUTPUT specifies the name of the .SEL/.IDX/.SEQSEL/.SEQIDX/.LINK files. By default the file name/directory will be the name/directory of the first document found. The following switch will override this: /OUTPUT=filename - specify the device, directory, filename part of the .IDX, .SEL and .LINK files. /PUNCTUATION defines which characters of an input line are replaced by spaces before words are selected from that line for indexing. /PUNCTUATION="..." - define the punctuation characters. The default set of punctuation is .,?:()". Space is always a punctuation character. /SELECTOR specifies what kind of selector will be generated for each topic. If /SELECTOR is not given, a type 'R' (range) selector is automatically calculated for each topic. /SELECTOR=(TEXT=str, END=str, BOTH, IGNORE=char) TEXT=string says that any text following the string (up to the END string, next space or end of line) will be used as the selector for the current topic. This switch can be used to index textual desciptions of binary files, and have links to the binary files returned in the search hit lists. The format of the selector string is optionally followed by "|" (the vertical bar character), a network host name, "|", and a TCP socket number. Type must be a Gopher protocol file type (i.e., the value given in a .link 'Type=' entry). Path similarly must be a complete Gopher protocol 'Path=' value, and must include the Gopher or HTTP DataDirectory device. If there is no selector found and BOTH is not requested, no selector or index entries will be written. For example, a selector for a binary file specified by /SELECTOR=(TEXT="selector: ") could be: selector: I9gopher_root:[images]picture.gif or: selector: s9www_root:[sounds]music.au. A selector for a CCSO phone book (CCSO path fields are blank) which uses port 105 on host 'ns.host.com' would be: selector: 2|ns.host.com|105 END=string says what will end a selector string if the default rules (end of line or " -->") aren't enough. BOTH says to generate selector (and index) entries for both the selector specified and the current input file topic. IGNORE=character says that any selector beginning with that character (presumably, a gopher Type) should cause the entry to be excluded from the index. This, for example, can be used to exclude material associated with information (Type=i) selectors when indexing a walker output file. /SEQUENTIAL specifies whether an indexed (.IDX/.SEL) or sequential (.SEQIDX/.SEQSEL) index/selector files are created. /SEQUENTIAL - create sequential index/selector files /NOSEQUENTIAL - (the default) create indexed files Omitting the /SEQUENTIAL switch is simpler, but can be much slower than building the indexed files with CONVERT/FDL after generating the sequential index/selector files. /SPECIFICATION specifies a file that may contain the switches defined in this document. The only exceptions are /SPECIFICATION and the input file. Specification input lines may be up to 255 characters long, but must not be continued. /TOPIC determines how to describe each section of the document. By default, the first line of each section is use as the description. For those documents that have lines that start with keywords (such as "Subject:"), you may describe those keywords with the /TOPIC switch: /TOPIC=(TEXT=str, EXCLUDE, BREAK, POSITION=n, OFFSET=n, SIZE=n, END=str) Up to 20 /TOPIC switches may be specified for the file(s) to be indexed. All the text found will be modified by the SIZE and EXCLUDE keywords and concatenated to form the selector string. TEXT=string - record the last line in the section section that starts with the specified string. Case is ignored, strings that contain spaces and punctuation should be enclosed in quotes. EXCLUDE - exclude the contents of the TEXT string from the line found to match the TEXT string. BREAK - divide the document whenever this topic is matched. SIZE=n - truncate the line found to match the TEXT string to n characters. POSITION=n - defines (with SIZE) a field to use for the topic name if the field is non-blank. - defines (with TEXT) where to look for the text string. POSITION=0 (the default) says to search the entire line for the text string. TEXT or SIZE is required, defaults are no exclude and no truncate. For example: /TOPIC=(TEXT="date: ", SIZE=10, EXCLUDE) will use 10 characters of the line that starts with "date: " starting immediately after "date: ". /TOPIC=(POSITION=1, SIZE=6) will use anything (except blanks) in columns 1 through 6 as the topic. /TOPIC=(TEXT="date: ", POSITION=1) will only look for "date: " in column 1 of each line. OFFSET=n - says to skip 'n' characters after a topic match before extracting a topic title. END=string - says to stop the topic title at "string". Defaults to "