Subj : Re: Generating an Index
To   : comp.programming
From : Patricia Shanahan
Date : Fri Sep 09 2005 12:53 pm

Wavemaker wrote:
....
> This problem interested me enough to write a small program to try out
> various approaches. One thing I noticed is that sorting by distance to a
> reference value brought duplicate values closer together, which is what
> you would expect. Actually, since my data set was small, it always
> brought them right next to each other, but in a larger data set I don't
> see that as guaranteed.

Simply sorting on the bit pattern as a text field gets guaranteed
duplicate adjacency.

Given the sparseness of the numbers, and the difficulty of the
index-and-sort approach, how about an entirely different data structure?

For example, how about a graph in which each entry has links to its
immediate neighbors? While one entry could have 255 links, the average
might be a lot less, depending on the data.

A breadth-first search of the graph starting from any vertex would find
that vertex's distance 1 neighbors first, then distance 2, etc. Use one
graph vertex for each group of duplicates, but have them adjacent in the
file so that when you find one you have found them all.

Adding a node would require a scan to find all its distance 1 neighbors,
so performance depends on the relative frequency of insert and search.

In terms of the hypercube view, the graph vertices are vertices of the
hypercube for which there is at least one entry, and graph edges are
those edges of the hypercube for which each vertex has at least one entry.

Patricia

.