Subj : Re: Generating an Index
To   : comp.programming
From : Wavemaker
Date : Thu Sep 08 2005 07:00 pm


"Mary" wrote:
> Hi Wavemaker,
>
> Yes, you are correct in your understanding.
>
> I was hoping that I could create a third column in the database that
> contained a "weighting" factor (or something like that) based on the
> "Data" column, so that when I sorted the table on this new column I
> would find records where the distance between them is lowest. Note 
> that
> it's not as simple as counting up the 1's...eg:
>
> 101010101010 and
> 010101010101
> Both have the same number of 1's, but the distance is 12, ie 100%
> different.
>
> Taking your data from your post, where you have the 4 records, it 
> would
> be reasonable (and I would expect) that there would be several records
> in a row where they differ from each other by a common amount...
>
> 1 000000000000
> 2 010000000000
> 3 001000000000
>
> Each has a distance of 1 from each other.
> Consider that in my current situation, with 100,000 records:
>
> 1  000000000000
> .
> .
> 55   011111111111
> .
> .
> 100,000   111111111111
>
>
> As you can see, record 55 has a distance of 1 when compared to record
> 100,000. By not having any sort of "weighting factor", it would take
> ages for me to find this match.
> All I want to do is reduce the number of "reads" I have to do to find
> these matches...if I could just bring record 100,000 a bit closer by
> sorting on a third column. Even if I have to read 50-100 records to
> find it would be much better than having to read over 90,000

Hmm, I've thought about this and was about to suggest using 10101010... 
as a reference value and calculating the distance of each item in 
reference to that value. The results would be stored in the third 
column, which you would sort on. But I see that this has already been 
suggested (with the addition of using the Hamming Distance algorithm, 
which I don't know anything about).

Other than this, I'm at a loss. The only other solution I thought of is 
way too inefficient, and that would be to choose an item for your query, 
calculate the distance of all the other items in relation to that item, 
and then sort. If you're running lots of queries, this would not be 
practical, I would think.

In addition to what Willem has said, the problem I think is that the 
distance values are relative, so it's hard to come up with an overall 
solution that will work for all values. Maybe that's why 10101010 would 
be the best reference value in that it would help order the values in 
the most useful way. Sorry I can't be of more help.

.