Subj : Re: Locating Nearest Neighbors in space (fast)
To   : comp.programming
From : Randy
Date : Thu Sep 08 2005 02:52 pm

KRK wrote:
> Hi,
> 
> I'm programming in C and have two separate structure arrays holding 
> coordinates for points in space
> 
> typedef struct{
>     double x;
>     double y;
>     double z;
> } Position;
> 
> Position *array1;        /*Approx. 50,000 Entries*/
> Position *array2;        /*Approx 50,000 Entries*/
> 
> My problem is this, I want to take each point in array 1, and find its 
> 'nearest spatial neighbor' in array 2.  This can be done by brute force, but 
> is rather slow, since to do this requires on the order of 10^10 multiplies. 
> Further, I want to do this operation abot 180 times!
> 
> Does anyone know a fast algorithm to do this, preferably one that doesn't 
> require fancy knowledge of data structures etc since I'm not a professional 
> programmer.  Thanks.
> 
> 
> Karl 

There may be a standard answer for this, but since I don't know it, here
goes...

If you use manhattan distance (x2 - x1 + y2 - y1 + z2 - z1) instead of
true distance sqrt(x2 - x1)^2 + (y2 - y1)^2 + (z2 - z1)^2), as part of a
first pass that decides whether each point is close enough to warrant
further consideration, you can avoid the vast majority of multiplies
(ans sqrt calls) except when a point falls within the manhattan distance
and thus deserves further consideration.  As long as your points are not
tightly clustered, an approach that trades off fast math for slow math
will reduce the runtime significantly.

To wit:

// min_dist[] is "minimum real distance" (point w/ closest real dist)
// min_mdist[] is "minimum manhattan distance" (closest manhattan dist)
// min_idx[] is the index of the nearest point
//
Position p1[50000], p2[50000];  // more concise than array1, array2

for (a = 0; a < 50000; a++)
  for (b = 0; b < 50000; b++) {
    mdist = abs(p2.x[b] - p1.x[a])
          + abs(p2.y[b] - p1.y[a])
          + abs(p2.z[b] - p1.z[a]);

    if (mdist < min_mdist[a]) {
      dist = sqrt( (p2.x[b] - p1.x[a])^2
                 + (p2.y[b] - p1.y[a])^2
                 + (p2.z[b] - p1.z[a])^2);

      if (dist < min_dist[a]) {
        min_dist[a]  = dist;
        min_mdist[a] = mdist;
        min_idx[a]   = b;
      }
    }
  }

You can probably speed up the code further by cascading the outer
conditional so that the manhattan distance conditional can short circuit
as soon as possible, e.g.:

  // min_mdist is the maximum manhattan distance that excludes only
  //   points that are too far away
  ...
  dx = abs( p2.x[b] - p1.x[a]);

  if (dx < min_mdist) {
    dy = abs( p2.y[b] - p1.y[a]);

    if (dy + dx < min_mdist[a]) {
      dz = abs( p2.z[b] - p1.z[a]);

      if (dx + dy + dz < min_mdist[a]) {
        dist = sqrt( dx^2 + dy^2 + dz^2);
        ...
        min_mdist = dx + dy + dz;  // if you want cluster_size == 1

You may be able to speedup further if you sort your point data arrays
along the index which is most sparse (X, Y or Z).  Then you can add
another outermost conditional to the above code that will short circuit
the traversal along the sorted index (here, X) as soon as possible,
excluding all remaining points along the sorted index which are too far
away to be considered, e.g.:

  // sort point arrays on X
  ...
    for (b = 0; b < 50000; b++) {
      dx = abs( x2[b] - x1[a]);
      if (dx > min_mdist[a])
        break;  // exit to outer loop, and progress to next point a

      dy = abs( y2[b] - y1[a]);
      if (dy + dx < min_mdist[a])
      ...

Also, unless you need to use doubles, use the smaller float.  You'll get
better cache line reuse with smaller data.

Also, turn on the compiler's heavy optimization (in GCC: -O3,
-funroll-loops) and specify that you've used no pointer aliasing (in
GCC: -fstrict-aliasing).  But also try -O2, since -O3 often tries to do
too much.

Two final caveats.  1) In the above code, x^2 is really pow(x, 2.0).
2) Avoid pointer tricks.  (Better to use array indexing than pointer
arithmetic.)  Pointer tricks will often confuse the compiler and slow
down the code.

    Randy

-- 
Randy Crawford   http://www.ruf.rice.edu/~rand   rand AT rice DOT edu

"Overstatement sucks." -- William of Ockham

.