Subj : Re: Generating an Index To : comp.programming From : mensanator@aol.com Date : Thu Sep 08 2005 08:16 pm Mary wrote: > Hi Mensanator, > > Thank you for your valuable post. > > To be honest, although I understood what you were trying to say, I had > difficulty in following your example on how you calculated your numbers > from the array you supplied. Also, I'm not how you came by the numbers > in the array. That said, If I could get to grips with it, I think it > could use it to help solve my problem. > > Would it be possible (I'd really appreciate it) if you could show me > how to calculate the sum of, say, the following two codes. > > 0000011111 > 1100110011 Hamming Distance is the "cost" of changing the first string into the second. If the strings represented positions of toggle switches, you would have to flip 5 switches to convert the first to the second. Now you could write a loop to walk through the strings character by character tallying how many have to change to get the distance. But I converted your strings into binary numbers. The first string could be interpreted as the base 2 text representation of 31. This allows me to work with the data mathematically using math libraries that come with a lot of binary utility functions. The library I use is GMPY, the Python port of the Gnu Multiple Precision library for C. This library can convert strings into unlimited precision integers (so 255 character strings are no problem) and has functions such as popcount (counts the number of 1 bits in a binary number) and Hamming Distance. To actually answer your question: >>> import gmpy >>> >>> A = '0000011111' >>> B = '1100110011' >>> # right now they are just strings >>> # so I'll convert them to binary numbers >>> a = gmpy.mpz(A,2) # converts base 2 text to integer >>> b = gmpy.mpz(B,2) >>> # note that these are the decimal equivalents >>> print a,b 31 819 >>> >>> # the two numbers should have popcounts of >>> # 5 and 6 respectively >>> print gmpy.popcount(a),gmpy.popcount(b) 5 6 >>> # and now that they are binary numbers, >>> # we can ask for Hamming distance >>> print gmpy.hamdist(a,b) 5 Assuming you have a way of getting the Hamming Distance, here's how I used it. The array shows every possible comparison. Again, not practical for n=100000, but I'm trying to illustrate the greedy algorithm and I have an example where we don't actual have to create the n by n array. Here's a simple example. How many ways can we roll a pair of dice? I can make an array showing one die as the column heading and the other as the row heading: 1 2 3 4 5 6 -------------------- 1| 2 3 4 5 6 7 2| 3 4 5 6 7 8 3| 4 5 6 7 8 9 4| 5 6 7 8 9 10 5| 6 7 8 9 10 11 6| 7 8 9 10 11 12 This illustrates quite nicely dice probabilities. There are 36 values in the array of which six are 7, so probability of 7 is 6/36 or 1/6. But I could just as easily have taken the DIFFERENCE between the dice: 1 2 3 4 5 6 -------------------- 1| 0 1 2 3 4 5 2| 1 0 1 2 3 4 3| 2 1 0 1 2 3 4| 3 2 1 0 1 2 5| 4 3 2 1 0 1 6| 5 4 3 2 1 0 Note the diagonal of 0's and how the values are the same on both sides of the diagonal. This is similar to what we get when we compare every record in a table to every other record in the same table. The row and column indexes both represent the record index. Here the diagonal is comparing a record to itself, so it will always be 0. For your test numbers, my array indexes start at 0 and are shown in square brackets. The values are then the Hamming Distance between any two records: [ 0] [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [ 0] 0 1 11 7 7 4 7 8 8 12 [ 1] 1 0 10 8 6 3 6 7 7 11 [ 2] 11 10 0 6 6 9 6 5 5 1 [ 3] 7 8 6 0 4 5 4 3 3 5 [ 4] 7 6 6 4 0 5 4 3 1 5 [ 5] 4 3 9 5 5 0 3 4 4 8 [ 6] 7 6 6 4 4 3 0 1 3 5 [ 7] 8 7 5 3 3 4 1 0 2 4 [ 8] 8 7 5 3 1 4 3 2 0 4 [ 9] 12 11 1 5 5 8 5 4 4 0 We will, of course, not consider any of the 0's in this example since they all represent the distance from a string to itself. But 0's will show up when your tables have duplicates. Also, we need only consider the distances on one side of the diagonal. Now there are more numbers to choose from then we actually need. For n records, we need n-1 distances (assuming we don't count the distance from the last record back to the first). What a greedy algorithm does is always choose the smallest possible distance for each pair of records, the idea being that the aggregate total will be minimized. That actually works for a minimal spanning tree, but we are restricted here and as others have pointed out can't be done. But we might be able to improve the situation. Not counting the identity pairings, the smallest distance is 1, of which there are several such pairings: [0]-[1] [6]-[7] [4]-[8] [2]-[9] So we could start with any one of them, but we have to realize we probably won't be able to see the entire array, the following sequential greedy algorithm would work without needing 10 billion numbers (although it will still need to do 5 billion comparisons). 1) add record [0] 2) add record with smallest Hamming Distance to record [0] - this is record [1] so table now looks like [0] [1] 3) add record (from those remaining) with smallest Hamming Distance to record [1] (from the array, this is record [5] having distance 3). Table now looks like [0] [1] [5] 4) repeat step 3 until all records placed. Note the key item "from those remaining" in step 3. Once we place records [0] and [1], they are no longer considered, otherwise we would go round in circles. Example: with records [0] [1] [5] placed, we check row [5] of the array (with [0] [1] [5] excluded) [ 5] x x 9 5 5 x 3 4 4 8 The smallest distance available is 3 at [6], so [6] becomes the next number. Then we look at row [6] with [0] [1] [5] [6] excluded. [ 6] x x 6 4 4 x x 1 3 5 And now the smallest distance is 1 again at [7]. Note that we origianlly considered [6]-[7] as a starting position and even though we didn't start there, we still got the [6]-[7] pairing. If you look ahead to the final solution, you'll see we got all the distance 1 pairings. That may not always happen, but a greedy algorithm gives you a good shot at it. Every time we place a record, there will be one less to consider in the next iteration, so for n records we will end up doing SumOfIntegers(n) comparisons which is (n**2+ n)/2. Here's my actual algorithm in practice: Your original sample table had these Hamming Distances between adjacent records. In this layout, the distance is from the record to the previous record. Record [0] has no previous record, so a hyphen is shown. The 1 following [1] is the distance back to [0]. [0] - [1] 1 [2] 10 [3] 6 [4] 4 [5] 5 [6] 3 [7] 1 [8] 2 [9] 4 Original Total Distance: 36 The Total Distance would be how many switches I have to toggle to get from [0] to [9]. Applying the sequential greedy algorithm yields [0] - [1] 1 [5] 3 [6] 3 [7] 1 [8] 2 [4] 1 [3] 4 [9] 5 [2] 1 New Total Distance: 21 Our total is less (if that means anything to you) and the worse case distance is now 5 instead of 10. I don't know if that's the best that can be done. The non-sequential algorithm I did manually also got 21 with a slightly different sequence. In tests I ran with small sets of random numbers, the sequential greedy algorithm always reduced the total distance. If you want to see the actual Python code, I can post that. > > I think that your final comment on using a reference pattern has value > and is probably the way I'll go...as soon as I understand how you come > to your figures. I don't know if that would help. With a simple sequential greedy algorithm it may not make any difference. Where it _might_ help is you could do a Hamming Distance calculation a mere 100000 times against a reference value instead of 5 billion comparisons used by the sequential greedy algorithm. But I don't know if that would end up improving anything. > > I have been researching "Hamming Distances" and "Greedy Algorithms" and > although I am gaining some understanding, I definately can't say that > I'm 100% comfortable with them yet. > > Thank you, and I look forward to your follow-up post. > > MJ .