The INDEXER Program

The Indexing Problem

An index is a table that lists the important topics of a book in
alphabetical order, showing for each the numbers of the pages on
which that topic is mentioned.	An index is an indispensable
part of any non-fiction book.  Even a poor index can give better
access to a book than a table of contents can, while a good
index increases the utility of a book many times.

Indexing a book is a skilled job.  The indexer must understand
the subject of the book well, so as to know what the important
topics are.  The indexer must read the book and comprehend it,
so as to know when a reference to a topic is important enough
to merit an index entry.  The indexer must understand the
intended use, and users, of the book well, so as to know the
terms that they will expect to find in the index.

An index of sorts can be built automatically by sorting the
words of a document file, eliminating duplicates, and appending
the page numbers on which the words appeared.  The indexes of
the Pascal/Z manuals appear to have been made by such a process,
and they demonstrate the failings of the method.  It only
indexes words, it does not index topics.  It works only to the
extent that the topics of the book are fully expressed in single
words (and to the extent that the words were used consistently).
It is completely unselective, showing every reference no matter
how trivial.  There can be no provision for sub-entries under a
major topic, nor for multiple entries for a topic under
different terms.

A good index can be built automatically by some word processing
programs.  The writer embeds indexing commands within the text
of a document.	When the program formats the document for
printing, it writes each embedded index term to a disk file with
the page number on which it appeared.  This method can work
well, since the entries are topical terms chosen by the writer,
who embeds them only at the relevant points.  I have developed
and used such a system for documents formatted by Magic Wand.

Both automatic methods fail if the book is to be typeset.  They
rely on page numbers as they exist in the machine-printed form
of the work.  The page numbers of the typeset work will be
different, so the machine-made index will have to be heavily
edited or completely remade.

And of course, the automatic methods can't work when the book to
be indexed is not available in machine-readable form.  In these
cases, the index must be made by hand.	In the traditional
method, the indexer sets up a file of 4x5 (index!) cards, one
for each topic to be indexed.  Then he/she reads the book and,
wherever a topic appears, writes the page number on its card.
The sorted cards provide the material for typing the index.

The INDEXER program is a machine aid for a human indexer.  It
does the job of the pile of index cards, and it makes the
finished index automatically, as a disk file that can be edited
or printed.


INDEXER's Operations


The INDEXER program works as follows.  When invoked as a CP/M
command, it looks for a file containing a saved index in
internal form.	If there is such a file (one named INDEXER.TRE
on the default drive), the program loads it, and work can
continue from where you left off before.

Once initialized, INDEXER prompts for a "term."  You may type an
index term of up to 64 characters.  Spaces, commas, and indeed
any printable characters are allowed.  You may use the backspace
key to correct typing errors.  The end of the term is signalled
with the Return key.

INDEXER files the term in storage and prompts for a "page."  It
expects the page number of a reference to the term just entered,
an integer from 1 to 32767.  It will accept a negative integer
and store it.  It will also accept a zero, but for reasons of
its own it will store a page number of zero as -32767.	The
program stores any term or subterm only once.  It collects all
the page references for that term and stores them in a list with
the term.

You may go on entering terms and pages as long as you like.
When you want to stop work, enter a term of length zero; that
is, press Return at the "term" prompt.	Then INDEXER writes the
index to disk in two forms.  It writes its collection of terms
and pages in internal form as INDEXER.TRE, so that it will be
able to reload them and pick up work where it left off.  It also
writes a finished index file as INDEXER.OUT.  This file can be
edited with any word processor for final printing.


Using Sub-terms


INDEXER allows any term to have sub-terms and sub-sub-terms to a
depth of nine.	Subterms are very useful in indexes; they allow
you to group subtopics under a main topic entry.  Some readers
will look for a topic under its general term; others will look
for it under a very specific term.  For example, a good index
for the Pascal/Z manual might index the CASE statement two or
even three ways:
	   ...
      CASE statement 27
	 ELSE allowed 4, 69
	 option J 41
	 speed of vs. IF 43
	  ...
      ELSE clause 27
	 with CASE 4, 69
	  ...
      statement types
	 assignment 25
	 BEGIN 32
	 CASE 27
	  ...

To INDEXER, each group of subterms is a little index of its own.
Its terms are stored and sorted, and their page numbers are
collected, just as is done for the main terms.	When it prepares
the output file, INDEXER indents once for each sub-term level.

To enter a subterm, you first enter its main term.  But instead
of ending the main term by pressing Return, you end it by
pressing LineFeed (or control-J if your terminal lacks a
LineFeed key).	INDEXER then prompts for a " . .term," indenting
one level on the screen.  Enter the subterm just as you would a
main term.  End it, too, with LineFeed if you are entering a
sub-subterm.  When you eventually end a term with Return, you
will be prompted for a page number as usual.  That page number
will be associated with the subterm, not with its superior
term(s).


Term Recall


Typing all these terms is tedious, but INDEXER has a feature
which can save a lot of the labor.  The feature is called Term
Recall, and it serves two purposes.

You recall a term by typing some of its initial letters, then
pressing the Escape key.  INDEXER searches its list of terms for
the alphabetically-lowest one that matches the initial letters
you have typed to that point.  It then completes the term on the
screen by typing the remaining letters of that term.  If that is
the term you wanted to recall, you may then press Return or
LineFeed just as if you had typed the whole term yourself.  Or
you can modify the term as it appears then by backspacing and
retyping part of it.

If the term is not the one you want, just press Escape again.
INDEXER will wipe out the letters it supplied, find the next
term in alphabetic order, and show its final letters.  If you
keep pressing Escape, you will be shown all the terms that match
the initial letters you typed.	When there are no more, INDEXER
beeps the console alarm.

If you decide that you don't want any of the recalled terms,
press the Delete key.  INDEXER will restore the line to just the
characters you had typed initially.

Term Recall can save a lot of typing.  It also provides a way to
review the terms you have defined so far.  Press Escape without
typing any initial letters at all; INDEXER will complete your
non-existent entry by showing you the first of all the terms it
has, and will step through all the terms as you press Escape.


How To Index With INDEXER


Start by marking up a copy of the book.  Read through it
carefully with a pencil and a highlighting pen in hand.  Mark
every term to be indexed, and note in the margins where a topic
should be entered under more than one term.

Then, book in lap, start up INDEXER using the INDEXER.SUB submit
file:

      SUBMIT INDEXER bookname

This submit file contains CP/M commands that will keep INDEXER's
two output files as bookname.TRE and bookname.INX, so you can
have index work in progress for several different books at once.

When INDEXER starts up, begin entering terms and pages in the
order they occur in the book. When you have finished the book,
or when you want to stop, just hit Return at the "term" prompt.
INDEXER will write its files, and the submit file will rename
them.  If you are not finished, come back to the index later
with the same submit command; you will be able to pick up where
you left off.

You may print bookname.INX, or edit it with your favorite word
processor to insert print-formatting commands.	One reason for
editing the file is to delete errors.  There is no way to get
INDEXER to forget a term once you have entered it.  If you enter
a term incorrectly, give it a page number of zero.  That will
show up in the output file as page 32767.  You can find such
lines with your editor and delete them.


INDEXER's Limitations


Please do not enter many index terms in alphabetical order.  The
scheme for term storage (a binary tree) assumes that terms will
be entered in approximately random sequence.  If they are all
given in alphabetic order, the program will fail.

INDEXER's storage is large, but not too large.  To put it as a
formula, INDEXER uses 15+(2*termlength) bytes for every unique
term or subterm, and four bytes for every page number.	If terms
average about 20 bytes each, and if there are about two page
numbers per term, INDEXER can handle about 500 terms in a 64KB
machine.  If the program runs out of storage, it will crash with
a run-time error message.  So if your index is approaching 500
terms (about nine pages as printed), save your work often.


Programmer Details


INDEXER stores terms as character strings.  I chose to do my own
strings rather than using the Pascal/Z string type, so that the
program could be ported to other compilers.  Since most terms
are shorter than the 64 bytes allowed, keeping every term as an
independent string would waste storage.  Except when they are
being entered or written, terms are stored compactly in blocks
of 2048 bytes, and referenced by a block number and an offset
index.	I've allowed for up to 16 blocks (32Kb).  The code is
about 18Kb, so what with Pascal stack space and the binary tree
nodes, the full 16 blocks will probably never be allocated.

Terms are stored in a binary tree, and subterms under a term are
stored in a tree dangling from the superior term's node.  Page
references are stored in an ordered chain anchored in the term's
node.  In some cases, the trees are processed with recursive
algorithms, as traditional (see J and W program 11.5).	But more
often, recursion was inconvenient and would have eaten up too
much stack space.  Where a tree is to be "walked" in lexical
(in-) order, it is done by setting up a tree-walk record which
is processed by a "treestep" function.	That figures out the
next node and returns a pointer to it, saving the state of the
walk in the tree-walk record.  I played a lot of games with the
trees, some of which are (I think) quite clever.  The Term
Recall feature is based on tree-walking, and it turned out to be
a really slick user interface.

An index has to be in true alphabetical, not ASCII, order.  The
only way to make "Apple" come after "anteater" is to do all
comparisons in upper case.  To keep the speed up, I stored every
term two ways: as entered, and in all-caps.  The all-caps form
is used for all comparisons; the as-entered form is used for all
output.  This has the side effect that the terms "Apple" and
"apple" are identical, and only the first one entered will
appear in the output.  If storage was critical, it would be
possible to store only the original form of a term, and convert
it to all-caps prior to any comparison.

There is no provision for odd page number systems.  The program
can't handle hyphenated or alphabetical page numbers.  It can be
done, but it requires making a page number into a string, not an
integer.  It also introduces complications in keeping page
numbers in order (getting page "7-2" to collate before page
"7-10") and in eliding sequential page numbers (compressing page
numbers 7-1, 7-2, and 7-3 into a single reference).  Since the
program is designed for use on real books, integers are fine.

At 800+ lines, INDEXER is one big program.  I wouldn't be at all
surprised if it still contained a bug or two, so be alert.  It
compiles, assembles, and links in less than 10 minutes on a 4
MHZ system with double density drives.	I tried applying the
PASOPT program to it, with unimpressive results.  PASOPT read
that immense source file and managed to save 108 bytes of
machine code, or less than one percent.  Wowee.

.