Nearest neighbor searching is an important problem in various applications, including e-commence product searches, web searches, image retrieval, data mining, pattern recognition, and data compression. The problem can be formally described as follows. Given a set S of data points, the task is to process these data points so that, given any query data point q, the data point nearest to q (with respect to a certain distance measure) can be reported quickly.
In many applications, users are satisfied with finding an approximate answer that is “close enough” to the exact answer. The approximate nearest neighbor can be defined as the follows. Consider a set S of data points and query point q. Given ε>0, a point p is said to be a (1+ε)-approximate nearest neighbor of q if:Dist(p,q)<=(1+ε)dist(p*,q)
where p* is the true nearest neighbor to q.
A number of techniques have been proposed or suggested for determining an approximate nearest neighbor (ANN), especially for high-dimensional data retrieval since exact nearest neighbor retrieval can become very inefficient in such settings. For example, Sunil Arya et al., “An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions,” J. of the ACM, (1994) focuses on improving index structures and better pruning techniques under the assumption that the number of desired ANNs is known in advance.
A need exists for improved methods and apparatus for approximate nearest neighbor searches. A further need exists for methods and apparatus for incremental approximate nearest neighbor searches in large data sets. Yet another need exists for methods and apparatus for incremental approximate nearest neighbor searches that do not require the number of desired ANNs to be known in advance.