Subj : Re: Generating an Index
To   : comp.programming
From : Wavemaker
Date : Fri Sep 09 2005 03:30 am


"Willem" wrote:
> Wavemaker wrote:
> ) Hmm, I've thought about this and was about to suggest using
> ) 10101010... as a reference value and calculating the distance of
> ) each item in reference to that value.
>
> There is no reason whatsoever why 10101010... would be a better
> reference value than 00000000... would be.  To see this, you have to
> realise that XORing all numbers with a given pattern will not change
> the relative distances of any two numbers.

You're right. I think I chose that value because it intuitively seemed
better without having a concrete reason. Assuming this approach has
merit, something I'm not sure of at all, then any value would do as a
reference value. Correct?

The reason I thought using a reference value might help was Mary's
example of the two values, 01111111 and 11111111 (give or take a few
bits). They have a distance of 1. Assuming those two items are located
thousands of items apart, sorting them by distance to a reference value
brings them closer together (assuming a reference value of 00000000).

But as you've pointed out, moving values around like this causes some
values that should be close together to move farther apart. If you take
the values 11110000 and 00001111 and sort them by distance to a
reference value of 00000000, there is a problem because they both have
the same distance to the reference value but have a maximum distance to
each other. But according to the sort, they should be side-by-side.

Whew, I'm back to thinking that this problem may not be solvable. At
least not by me.

This problem interested me enough to write a small program to try out
various approaches. One thing I noticed is that sorting by distance to a
reference value brought duplicate values closer together, which is what
you would expect. Actually, since my data set was small, it always
brought them right next to each other, but in a larger data set I don't
see that as guaranteed.

Duplicate items would have a distance of 0 between them, of course. If
there are a lot of duplicates and the goal is once having found a
particular item in the data set to then find the item with the smallest
distance from it, which is 0, this approach may have merit. Maybe. Mary
will have to explore this further to see if this approach is actually
useful.

.