Indexing Numeric Data with Isite

Beginning with Isite, release 2, Isearch is capable of indexing numeric data, dates and geospatial bounding boxes.  
Sample code showing what methods are required in the doctype is available in the FGDC doctype.  This document will 
give a general idea of how to take advantage of the new data types.

Iindex identifies fields just as it always has, using the ParseFields() method in the doctype.  Once the pointers to all of 
the fields have been located, the indexing process writes out a field coordinate table for each of the fields.

In the case of text fields, this table will contain the starting and ending offsets to the instance of the field being stored.  
In the case of numeric fields, the table will contain a pointer to the start of the field in the document, and the numeric 
value (or values - cf. dates and intervals).

In order to know which type of field table to write out, the indexer must have access to a table of field type.  In the 
FGDC doctype, this field type file is loaded by the doctype and stored internally.  The field table is simply a list of the 
non-text fields to be found in the documents, along with their types.  For the FGDC doctype, the field table looks like 
this:

------------------------------------------------
attrmres=num
begdate=date
begtime=time
bounding=gpoly
caldate=date
denflat=num
eastbc=num
enddate=date
endtime=time
feast=num
fnorth=num
latprjo=num
longcm=num
northbc=num
numdata=num
numstop=num
procdate=date
proctime=time
pubdate=date
pubtime=time
rdommax=num
rdommin=num
rngdates=text
southbc=num
srcscale=num
stdparll=num
time=time
timeinfo=text
timeperd=date-range
westbc=num
----------------------------------------------

The allowed field types are "num", "date", "date-range", "time" and "gpoly".  Note that "time" is not currently 
implemented. The type "gpoly" is currently implemented for a geospatial bounding rectangle, but the terminology hints 
that we may extend this to arbitrary polygons someday.  The types "date" and "date-range" are similar, but with slight 
differences, described below.

The FGDC doctype receives the name of the field type file from one of the doctype command line options:

Iindex -d mydata -t fgdc -o fieldtype=fgdc.fields *.sgml

where the field type file is called "fgdc.fields".

Now, back to indexing.  When the method WriteFieldData() in the INDEX class writes out the field tables, it first looks 
up the field type.  If the field is numeric or date, it calls a parsing routine in the doctype to convert the text contents of 
the field to the appropriate numeric values.  Numeric fields (including individual latitude and longitude fields) are 
converted by the doctype method ParseNumeric().  Its job is to take the text string in the field and return the appropriate 
numeric value as a double.  Different doctypes will have different text representations of the values, so you will have to 
write a parser for whatever doctype you're implementing.

Date fields are converted by the doctype methods ParseDate() and ParseDateRange().  They take the contents of the 
field and calculate two numeric (double) values - the starting and ending values of a date range interval.  The computed 
values can be of differing precision, but should be of the form YYYY (eg, 1986), YYYYMM (eg, January 1997 would 
be converted to 199701) or YYYYMMDD (eg, 15 April 1984 would be converted to 19840415).  Fractional days 
should be converted to the obvious thing.

If you examine FGDC::ParseDate() and FGDC::ParseDateRange() you will note some differences.  The buffer passed 
to the routines can be a numeric date, or can be a tagged field - FGDC uses <CALDATE> to tag a single date (for 
example, Publication date is a single date), and uses <BEGDATE> and <ENDDATE> to tag a date interval.  Both date 
and date-range fields store intervals, but differ in how they treat single dates.  If the field is defined as type "date", the 
beginning and ending points of the interval will be the same - the specified date.  If the field is defined as "date-range", 
and if the dates are not of the full precision (for example, <CALDATE>1996</CALDATE>), then the beginning date 
will be extended to the starting date of the interval (i.e., 19960101), and the ending date will be extended to the ending 
date of the interval (i.e., 19961231), at the full precision.

This makes the code to execute the search simpler, but may result in unexpected results - hopefully you'll get too many 
rather than too few results. 

There are a couple of special dates, as well.  The current date (that is, the date on which the search is being run) can be 
encoded as the constant DATE_PRESENT.  Errors can be returned as DATE_ERROR and unknown dates can be 
returned as DATE_UNKNOWN.  No negative dates are allowed (for those of you who did your research in previous 
millenia).

If your date field is single valued (that is, not an interval), you can return the same value for the starting and ending 
dates.  The search engine will treat it as a trivial interval.

To review, there are only three steps necessary to handle indexing the new data types.

1. Create the field type file
2. Add methods to load the field table, LoadFieldTable(), and parse the numeric data, ParseNumeric() and 
ParseDate(), to the doctype.
3. Index the data with Iindex, using the doctype -o command line option to pass the name of the field type file to the 
indexer.

Searching Numeric Data with Isite

There is no command parser for searching numeric data or dates with the command line Isearch or CGI gateway 
Isearch-cgi.  We'd welcome suggestions for a command syntax, but it has to be easy.  Right now, queries on numeric 
data and dates have to be submitted using Z39.50, since that protocol supports the full range of parameters a user might 
need to specify.

Spatial queries can, however, be submitted using Isearch.  See the command line help for the syntax of the -rect 
parameter.  Currently, Isearch returns a record if there is an overlap between the region specified in the user's query and 
the bounding box in the data record.  

Some explanation is required to understand the way the search engine matches dates.

Examples:

zclient localhost 6668 test 199601[1,31,2,14,4,5]

zclient localhost 6668 test 199601[1,31,2,16,4,5]

zclient localhost 6668 test 199601[1,31,2,18,4,5]

zclient localhost 6668 test "19960101 19961004[1,31,2,16,4,115,5,100]"

zclient localhost 6668 test "90 -90 180 -180[1,3111]"
