Subj : Re: Generating an Index To : comp.programming From : mensanator@aol.com Date : Thu Sep 08 2005 12:43 am Mary wrote: > Hi, I need some assistance with a query. To be honest, I'm not even > sure it can be done. > > I'll try to keep the information limited to only what's relevant to > what I have and what I am trying to achieve with this query. > > I have a table that contains around 100,000 records. For the sake of > this discussion, assume just two columns: > > ID Data > 1 000000000000 > 2 010000000000 > 3 011111111111 > 4 101100011101 > 5 110100011011 > 6 111100000000 > 7 111100000111 > 8 111100010111 > 9 111100011011 > 10 111111111111 > > > > The Data column contains only 1's and 0's. > The Data column is a text column, not numeric. > The Data column is actually 255 chars long. (I limited it above to 12 > for this example only) > Duplicates on the Data column are allowed and do exist. > > > With 100,000 records, you would note that in the above example record > 10 would actually be record 100,000. > > > My aim is to somehow sort the data (by creating a third column) so that > the records are in order of "string distance" or similar. In other > words, so that similar strings are located next to (or as close as > possible to) each other. > > > For example, taking the data above: > > The "distance" between record 8 and record 9 is 2 (ie Only two > positions in the Data are not the same) > The "distance" between record 3 and record 4 is 6 If your data strings were converted to binary numbers, there is a function called Hamming Distance that is exactly what you want: >>> help(gmpy.hamdist) Help on built-in function hamdist: hamdist(...) hamdist(x,y): returns the Hamming distance (number of bit-positions where the bits differ) between x and y. x and y must be mpz, or else get coerced to mpz. > > However, the "distance" between record 3 and record 10 is only 1, > but sorted in the normal fashion, there would be some 99,990 records > between them. For your example numbers, you could build a Hamming Distance array (Note that the distance from a number to itself is always 0) [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [ 0] 0 1 11 7 7 4 7 8 8 12 [ 1] 1 0 10 8 6 3 6 7 7 11 [ 2] 11 10 0 6 6 9 6 5 5 1 [ 3] 7 8 6 0 4 5 4 3 3 5 [ 4] 7 6 6 4 0 5 4 3 1 5 [ 5] 4 3 9 5 5 0 3 4 4 8 [ 6] 7 6 6 4 4 3 0 1 3 5 [ 7] 8 7 5 3 3 4 1 0 2 4 [ 8] 8 7 5 3 1 4 3 2 0 4 [ 9] 12 11 1 5 5 8 5 4 4 0 and then use a greedy algorithm to build up a sequnce that minimizes the Hamming Distance between sequential pairs. Start with any pair that have distance 1: [0] [1] Now (excluding [0]) take the record with the minimum distance to [1]: [0] [1] [5] Repeat until all records placed: [0] [1] [5] [6] [7] [8] [4] [3] [9] [2] Here the sum of all the distances is 21 (it is 35 in the original order). I also tried it another way (by collecting all the distance 1 pairs and then trying to link them together) and got [0] [1] [5] [6] [7] [3] [8] [4] [9] [2] which also sums to 21. I'm not sure if this is the minimum overall distance. A minimum spanning tree would definitely be minimum but is two dimensional. Records in a table are one dimensional. Also note that 0 is closer to [2] (HD=11) than it is to [9] (HD=12). And the Hamming Distance array is impractical with 100000 records. But maybe you could break it up into smaller pieces. I don't know why you need this type of sort but another possibility is distance from a reference pattern, say 01010101010101...0101010101 and then calculate the Hamming Distance from the reference pattern and sort by that. > > > I was hoping that as I add records to this table I could calculate a > number or a code or something to create a third column that could be > indexed. Accessing records using this index would give me the records > in order of "String Distance" or similar. > > I have looked up functions such as "Levenshtein - Edit String > Distance", which is fine when I have to strings to compare. I can't > see it helping in generating an index though. > > I hope I have been clear in my explanation. > > I would really appreciate any comments or discussion that could help me > achieve this. > > Thanks for your time, > > MJ .