2 views

I am developing a simulation program. There are herds of animals (wildebeests), and in that herd, I need to be able to find one animal that is away from the herd.

On the picture below, green dots are away from the herd. It is these points that I'd like to be able to find quickly.

Of course, there is a simple algorithm to solve that problem. Count the number of dots in the neighborhood of each point, and then if that neighborhood is empty (0 point in it), then we know that this point is away from the herd.

The problem is that this algorithm is not efficient at all. I have one million points, and applying this algorithm on each of the million points is very slow.

Is there something that would be faster? Maybe using trees?

We want to avoid that case. A group of green dots in the left corner would be chosen, even though they should not because it's not a single animal that is away from the herd, it's a group of animals. We are only looking for one single animal away from herd.

by (33.1k points)

For nearest neighbors queries, kd-trees are used. They result in O(n log n) queries (one query is in log(n) times n queries, and building the kd-tree is itself in O(n log n) ) which I can see working pretty fast for a couple millions of points, and there are libraries that are already pretty efficient as well (ANN for instance).

ANN stands for "Approximate nearest neighbors", and can be even faster when exact distances are not needed. You want to detect whether the first nearest neighbor distance is large or small, you can set a pretty high threshold which would make things even faster.

Simply determines the distance distribution to everyone nearest neighbor, and find the outliers. Sorting all these distances to determine the outliers is again in O(n log n).