Subj : Re: Generating an Index To : comp.programming From : Wavemaker Date : Thu Sep 08 2005 01:25 am "Mary" wrote: > Hi, I need some assistance with a query. To be honest, I'm not even > sure it can be done. > > I'll try to keep the information limited to only what's relevant to > what I have and what I am trying to achieve with this query. > > I have a table that contains around 100,000 records. For the sake of > this discussion, assume just two columns: > > ID Data > 1 000000000000 > 2 010000000000 > 3 011111111111 > 4 101100011101 > 5 110100011011 > 6 111100000000 > 7 111100000111 > 8 111100010111 > 9 111100011011 > 10 111111111111 > > > > The Data column contains only 1's and 0's. > The Data column is a text column, not numeric. > The Data column is actually 255 chars long. (I limited it above to 12 > for this example only) > Duplicates on the Data column are allowed and do exist. > > > With 100,000 records, you would note that in the above example record > 10 would actually be record 100,000. > > > My aim is to somehow sort the data (by creating a third column) so > that the records are in order of "string distance" or similar. In > other words, so that similar strings are located next to (or as close > as possible to) each other. > > > For example, taking the data above: > > The "distance" between record 8 and record 9 is 2 (ie Only two > positions in the Data are not the same) > The "distance" between record 3 and record 4 is 6 > > However, the "distance" between record 3 and record 10 is only 1, > but sorted in the normal fashion, there would be some 99,990 records > between them. > > > I was hoping that as I add records to this table I could calculate a > number or a code or something to create a third column that could be > indexed. Accessing records using this index would give me the records > in order of "String Distance" or similar. I think there is a problem here, or maybe I've misunderstood something. For example, here are two data items: 1 000000000000 2 010000000000 There is a distance of 1 (i.e. only one bit differs). Okay, say we have a third item: 1 000000000000 2 010000000000 3 001000000000 So far, so good. The distance between the first and second item is 1 as is the distance between the first and third item. The distance between the second and third item is two. The items are sorted. But here's where I run into a problem. Say there is a fourth item: 1 000000000000 2 010000000000 3 001000000000 4 011000000000 The third and fourth item have a distance of 1. The distance between the second and fourth item is also 1. However, the distance between the second and third item is two. We need to resort, but how? The fourth item belongs in the third position like so: 1 000000000000 2 010000000000 3 011000000000 4 001000000000 So now the distance between second and third item (formerly the fourth item) is 1 as is the distance between the third and fourth item. However, this doesn't work because the distance between the first and third item is 2 but the distance between the first and fourth item is 1, so while items two through three are sorted, the overall table is not. Is this making sense? It's quite possible I'm missing something, so hopefully a further explanation will help. .