Subj : Re: Generating an Index
To   : comp.programming
From : mensanator@aol.com
Date : Thu Sep 08 2005 08:16 pm

Mary wrote:
> Hi Mensanator,
>
> Thank you for your valuable post.
>
> To be honest, although I understood what you were trying to say, I had
> difficulty in following your example on how you calculated your numbers
> from the array you supplied. Also, I'm not how you came by the numbers
> in the array. That said, If I could get to grips with it, I think it
> could use it to help solve my problem.
>
> Would it be possible (I'd really appreciate it) if you could show me
> how to calculate the sum of, say, the following two codes.
>
> 0000011111
> 1100110011

Hamming Distance is the "cost" of changing the first string
into the second. If the strings represented positions of toggle
switches, you would have to flip 5 switches to convert the first
to the second. Now you could write a loop to walk through the
strings character by character tallying how many have to change
to get the distance.

But I converted your strings into binary numbers. The first
string could be interpreted as the base 2 text representation
of 31. This allows me to work with the data mathematically
using math libraries that come with a lot of binary utility
functions. The library I use is GMPY, the Python port of
the Gnu Multiple Precision library for C. This library can
convert strings into unlimited precision integers (so 255
character strings are no problem) and has functions such as
popcount (counts the number of 1 bits in a binary number) and
Hamming Distance.

To actually answer your question:

>>> import gmpy
>>>
>>> A = '0000011111'
>>> B = '1100110011'
>>> # right now they are just strings
>>> # so I'll convert them to binary numbers
>>> a = gmpy.mpz(A,2)  # converts base 2 text to integer
>>> b = gmpy.mpz(B,2)
>>> # note that these are the decimal equivalents
>>> print a,b
31 819
>>>
>>> # the two numbers should have popcounts of
>>> # 5 and 6 respectively
>>> print gmpy.popcount(a),gmpy.popcount(b)
5 6
>>> # and now that they are binary numbers,
>>> # we can ask for Hamming distance
>>> print gmpy.hamdist(a,b)
5

Assuming you have a way of getting the Hamming Distance, here's
how I used it. The array shows every possible comparison. Again,
not practical for n=100000, but I'm trying to illustrate the
greedy algorithm and I have an example where we don't actual
have to create the n by n array. Here's a simple example.

How many ways can we roll a pair of dice? I can make an array
showing one die as the column heading and the other as the
row heading:

    1  2  3  4  5  6
--------------------
1|  2  3  4  5  6  7
2|  3  4  5  6  7  8
3|  4  5  6  7  8  9
4|  5  6  7  8  9 10
5|  6  7  8  9 10 11
6|  7  8  9 10 11 12

This illustrates quite nicely dice probabilities. There are 36
values in the array of which six are 7, so probability of 7 is
6/36 or 1/6.

But I could just as easily have taken the DIFFERENCE between
the dice:

    1  2  3  4  5  6
--------------------
1|  0  1  2  3  4  5
2|  1  0  1  2  3  4
3|  2  1  0  1  2  3
4|  3  2  1  0  1  2
5|  4  3  2  1  0  1
6|  5  4  3  2  1  0

Note the diagonal of 0's and how the values are the same on
both sides of the diagonal. This is similar to what we get when
we compare every record in a table to every other record in the
same table. The row and column indexes both represent the record
index. Here the diagonal is comparing a record to itself, so
it will always be 0. For your test numbers, my array indexes
start at 0 and are shown in square brackets. The values are then
the Hamming Distance between any two records:

      [ 0] [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9]
[ 0]    0    1   11    7    7    4    7    8    8   12
[ 1]    1    0   10    8    6    3    6    7    7   11
[ 2]   11   10    0    6    6    9    6    5    5    1
[ 3]    7    8    6    0    4    5    4    3    3    5
[ 4]    7    6    6    4    0    5    4    3    1    5
[ 5]    4    3    9    5    5    0    3    4    4    8
[ 6]    7    6    6    4    4    3    0    1    3    5
[ 7]    8    7    5    3    3    4    1    0    2    4
[ 8]    8    7    5    3    1    4    3    2    0    4
[ 9]   12   11    1    5    5    8    5    4    4    0

We will, of course, not consider any of the 0's in this example
since they all represent the distance from a string to itself.
But 0's will show up when your tables have duplicates. Also, we
need only consider the distances on one side of the diagonal.

Now there are more numbers to choose from then we actually need.
For n records, we need n-1 distances (assuming we don't count
the distance from the last record back to the first). What a
greedy algorithm does is always choose the smallest possible
distance for each pair of records, the idea being that the
aggregate total will be minimized. That actually works for a
minimal spanning tree, but we are restricted here and as others
have pointed out can't be done. But we might be able to improve
the situation.

Not counting the identity pairings, the smallest distance is 1,
of which there are several such pairings:

[0]-[1]
[6]-[7]
[4]-[8]
[2]-[9]

So we could start with any one of them, but we have to realize
we probably won't be able to see the entire array, the following
sequential greedy algorithm would work without needing 10 billion
numbers (although it will still need to do 5 billion comparisons).

1) add record [0]
2) add record with smallest Hamming Distance to record [0]
   - this is record [1] so table now looks like

[0] [1]

3) add record (from those remaining) with smallest Hamming
   Distance to record [1] (from the array, this is record [5]
   having distance 3). Table now looks like

[0] [1] [5]

4) repeat step 3 until all records placed.

Note the key item "from those remaining" in step 3. Once we place
records [0] and [1], they are no longer considered, otherwise
we would go round in circles.

Example: with records [0] [1] [5] placed, we check row [5]
         of the array (with [0] [1] [5] excluded)

[ 5]    x    x    9    5    5    x    3    4    4    8

The smallest distance available is 3 at [6], so [6] becomes
the next number. Then we look at row [6] with [0] [1] [5] [6]
excluded.

[ 6]    x    x    6    4    4    x    x    1    3    5

And now the smallest distance is 1 again at [7]. Note that we
origianlly considered [6]-[7] as a starting position and even
though we didn't start there, we still got the [6]-[7] pairing.
If you look ahead to the final solution, you'll see we got all
the distance 1 pairings. That may not always happen, but a greedy
algorithm gives you a good shot at it.

Every time we place a record, there will be one less to consider
in the next iteration, so for n records we will end up doing
SumOfIntegers(n) comparisons which is (n**2+ n)/2.

Here's my actual algorithm in practice:

Your original sample table had these Hamming Distances between
adjacent records. In this layout, the distance is from the
record to the previous record. Record [0] has no previous
record, so a hyphen is shown. The 1 following [1] is the
distance back to [0].

[0] - [1] 1 [2] 10 [3] 6 [4] 4 [5] 5 [6] 3 [7] 1 [8] 2 [9] 4

Original Total Distance: 36

The Total Distance would be how many switches I have to toggle
to get from [0] to [9].

Applying the sequential greedy algorithm yields

[0] - [1] 1 [5] 3 [6] 3 [7] 1 [8] 2 [4] 1 [3] 4 [9] 5 [2] 1

New Total Distance: 21


Our total is less (if that means anything to you) and the worse
case distance is now 5 instead of 10. I don't know if that's
the best that can be done. The non-sequential algorithm I did
manually also got 21 with a slightly different sequence.

In tests I ran with small sets of random numbers, the sequential
greedy algorithm always reduced the total distance.

If you want to see the actual Python code, I can post that.

>
> I think that your final comment on using a reference pattern has value
> and is probably the way I'll go...as soon as I understand how you come
> to your figures.

I don't know if that would help. With a simple sequential greedy
algorithm it may not make any difference. Where it _might_ help
is you could do a Hamming Distance calculation a mere 100000 times
against a reference value instead of 5 billion comparisons used
by the sequential greedy algorithm. But I don't know if that would
end up improving anything.

>
> I have been researching "Hamming Distances" and "Greedy Algorithms" and
> although I am gaining some understanding, I definately can't say that
> I'm 100% comfortable with them yet.
> 
> Thank you, and I look forward to your follow-up post.
> 
> MJ

.