Subj : Re: Generating an Index To : comp.programming From : Patricia Shanahan Date : Fri Sep 09 2005 12:53 pm Wavemaker wrote: .... > This problem interested me enough to write a small program to try out > various approaches. One thing I noticed is that sorting by distance to a > reference value brought duplicate values closer together, which is what > you would expect. Actually, since my data set was small, it always > brought them right next to each other, but in a larger data set I don't > see that as guaranteed. Simply sorting on the bit pattern as a text field gets guaranteed duplicate adjacency. Given the sparseness of the numbers, and the difficulty of the index-and-sort approach, how about an entirely different data structure? For example, how about a graph in which each entry has links to its immediate neighbors? While one entry could have 255 links, the average might be a lot less, depending on the data. A breadth-first search of the graph starting from any vertex would find that vertex's distance 1 neighbors first, then distance 2, etc. Use one graph vertex for each group of duplicates, but have them adjacent in the file so that when you find one you have found them all. Adding a node would require a scan to find all its distance 1 neighbors, so performance depends on the relative frequency of insert and search. In terms of the hypercube view, the graph vertices are vertices of the hypercube for which there is at least one entry, and graph edges are those edges of the hypercube for which each vertex has at least one entry. Patricia .