Subj : Re: Locating Nearest Neighbors in space (fast) To : comp.programming From : Randy Date : Thu Sep 08 2005 02:52 pm KRK wrote: > Hi, > > I'm programming in C and have two separate structure arrays holding > coordinates for points in space > > typedef struct{ > double x; > double y; > double z; > } Position; > > Position *array1; /*Approx. 50,000 Entries*/ > Position *array2; /*Approx 50,000 Entries*/ > > My problem is this, I want to take each point in array 1, and find its > 'nearest spatial neighbor' in array 2. This can be done by brute force, but > is rather slow, since to do this requires on the order of 10^10 multiplies. > Further, I want to do this operation abot 180 times! > > Does anyone know a fast algorithm to do this, preferably one that doesn't > require fancy knowledge of data structures etc since I'm not a professional > programmer. Thanks. > > > Karl There may be a standard answer for this, but since I don't know it, here goes... If you use manhattan distance (x2 - x1 + y2 - y1 + z2 - z1) instead of true distance sqrt(x2 - x1)^2 + (y2 - y1)^2 + (z2 - z1)^2), as part of a first pass that decides whether each point is close enough to warrant further consideration, you can avoid the vast majority of multiplies (ans sqrt calls) except when a point falls within the manhattan distance and thus deserves further consideration. As long as your points are not tightly clustered, an approach that trades off fast math for slow math will reduce the runtime significantly. To wit: // min_dist[] is "minimum real distance" (point w/ closest real dist) // min_mdist[] is "minimum manhattan distance" (closest manhattan dist) // min_idx[] is the index of the nearest point // Position p1[50000], p2[50000]; // more concise than array1, array2 for (a = 0; a < 50000; a++) for (b = 0; b < 50000; b++) { mdist = abs(p2.x[b] - p1.x[a]) + abs(p2.y[b] - p1.y[a]) + abs(p2.z[b] - p1.z[a]); if (mdist < min_mdist[a]) { dist = sqrt( (p2.x[b] - p1.x[a])^2 + (p2.y[b] - p1.y[a])^2 + (p2.z[b] - p1.z[a])^2); if (dist < min_dist[a]) { min_dist[a] = dist; min_mdist[a] = mdist; min_idx[a] = b; } } } You can probably speed up the code further by cascading the outer conditional so that the manhattan distance conditional can short circuit as soon as possible, e.g.: // min_mdist is the maximum manhattan distance that excludes only // points that are too far away ... dx = abs( p2.x[b] - p1.x[a]); if (dx < min_mdist) { dy = abs( p2.y[b] - p1.y[a]); if (dy + dx < min_mdist[a]) { dz = abs( p2.z[b] - p1.z[a]); if (dx + dy + dz < min_mdist[a]) { dist = sqrt( dx^2 + dy^2 + dz^2); ... min_mdist = dx + dy + dz; // if you want cluster_size == 1 You may be able to speedup further if you sort your point data arrays along the index which is most sparse (X, Y or Z). Then you can add another outermost conditional to the above code that will short circuit the traversal along the sorted index (here, X) as soon as possible, excluding all remaining points along the sorted index which are too far away to be considered, e.g.: // sort point arrays on X ... for (b = 0; b < 50000; b++) { dx = abs( x2[b] - x1[a]); if (dx > min_mdist[a]) break; // exit to outer loop, and progress to next point a dy = abs( y2[b] - y1[a]); if (dy + dx < min_mdist[a]) ... Also, unless you need to use doubles, use the smaller float. You'll get better cache line reuse with smaller data. Also, turn on the compiler's heavy optimization (in GCC: -O3, -funroll-loops) and specify that you've used no pointer aliasing (in GCC: -fstrict-aliasing). But also try -O2, since -O3 often tries to do too much. Two final caveats. 1) In the above code, x^2 is really pow(x, 2.0). 2) Avoid pointer tricks. (Better to use array indexing than pointer arithmetic.) Pointer tricks will often confuse the compiler and slow down the code. Randy -- Randy Crawford http://www.ruf.rice.edu/~rand rand AT rice DOT edu "Overstatement sucks." -- William of Ockham .