The exemplary embodiment relates to graphical representations of data and finds particular application in connection with the identification of nearest neighbors to a given data point, based on Minimax distances and for detection of whether the data point is an outlier.
Identifying the appropriate data representation plays a key role in machine learning tasks, both in supervised and unsupervised learning methods. Different data representations can be generated, depending on the choice of distance measure. Examples of distance measures that are commonly used include the (squared) Euclidean distance, Mahalanobis distance, cosine similarity, Pearson correlation, and Hamming and edit distances. Such distances often make explicit or implicit assumptions about the underlying structure in the data. For example, squared Euclidean distance assumes that the data stays inside a (multidimensional) sphere, while the Pearson correlation is only suitable for temporal and sequential data.
However, in many applications, the structure of the data is often very complex, folded and a priori unknown. Therefore, any fixed assumption about the data can easily fail, which means there is a high potential for model mismatch, under-fitting or over-fitting. Thus, it is often helpful to enrich the basic data representation with a meta representation which is more flexible in identifying the structure. One approach is to use a kernel (see, e.g., Shawe-Taylor, et al., “Kernel Methods for Pattern Analysis,” Cambridge University Press, 2004; and Hofmann, et al., “A review of kernel methods in machine learning,” Technical Report No. 156, Max Planck Institute for Biological Cybernetics, pp. 1-52, 2006). However, the choice of an appropriate kernel as well as its computational complexity, restrict the use of this approach.
One category of distance measures, so-called link-based measures, take into account all the routes between the objects represented in a graph (see, Fouss, et al., “Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation,” IEEE Trans. on Knowl. and Data Eng., 19(3):355-369, 2007, hereinafter, “Fouss 2007”; Chebotarev, “A class of graph-geodetic distances generalizing the shortest-path and the resistance distances” Discrete Appl. Math., 159(5):295-302, 2011). The route-specific distance between nodes i and j can be computed by summing the edge weights on the route (Yen, et al., “A family of dissimilarity measures between nodes generalizing both the shortest-path and the commute-time distances,” KDD, pp. 785-793, 2008, hereinafter, Yen 2008). Yen's link-based distance is then obtained by summing up the route-specific measures of all routes between them. Such a distance measure can generally capture arbitrarily-shaped structures better than basic measures, such as the Euclidean and Mahalanobis distances. Link-based measures are often obtained by inverting the Laplacian of the distance matrix, done in the context of a regularized Laplacian kernel and a Markov diffusion kernel (Yen 2008; Fouss, et al., “An experimental investigation of kernels on graphs for collaborative recommendation and semisupervised classification,” Neural Networks, 31:53-72, 2012, hereinafter, “Fouss 2012”). However, computing all pairs of link-based distances entails inverting a N×N matrix, which yields a running time (N3). It is thus not suited to large-scale datasets.
Another distance measure, called the Minimax measure, selects the minimum largest gap among all possible routes between the two objects. This measure, also known as the “Path-based distance measure,” has been proposed for improving clustering results (Fischer, et al. “Path-based clustering for grouping of smooth curves and texture segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., 25(4):513-518, 2003, hereinafter, “Fischer 2003). It has also been proposed as an axiom for evaluating clustering methods (Zadeh, et al., “A uniqueness theorem for clustering,” Uncertainty in Artificial Intelligence (UAI), pp. 639-646, 2009). A straightforward approach to compute all-pairs Minimax distances is to use an adapted variant of the Floyd-Warshall algorithm. The running time of this algorithm is (N3) (Aho, et al., “The Design and Analysis of Computer Algorithms,” Addison-Wesley Longman Publishing Co., Inc., Boston, Mass., USA, 1st edition, 1974; Cormen, et al., “Introduction to Algorithms,” McGraw-Hill Higher Education, 2nd edition, 2001). This distance measure is also integrated with an adapted variant of K-means yielding an agglomerative algorithm whose running time is O(N2|E|+N3 log N) (Fischer 2003). The Minimax distance has been proposed for K-nearest neighbor searching.
The standard method (using, for example, the Euclidean distance), the metric learning approach (Weinberger, et al., “Distance metric learning for large margin nearest neighbor classification,” J. Mach. Learn. Res., 10:207-244, 2009), and shortest path distance (Dijkstra, “A note on two problems in connexion with graphs,” Numerische Mathematik, 1:269-271, 1959, hereinafter, Dijkstra, 1959”, Tenenbaum, et al., “A global geometric framework for nonlinear dimensionality reduction,” Science, 290(5500):2319-23, 2000) can all give poor results, e.g., on non-convex data, since they ignore the underlying geometry.
Kim, et al., “Neighbor search with global geometry: a Minimax message passing algorithm,” ICML, pp. 401-408, 2007, hereinafter, “Kim 2007” proposes a message passing algorithm with forward and backward steps, similar to the sum-product algorithm described in Kschischang, et al., “Factor graphs and the sum-product algorithm,” IEEE Trans. Inf. Theor., 47(2):498-519, 2006. The method takes (N) time, which is in theory equal to the standard K-nearest neighbor search, but the algorithm needs many more visits of the training dataset. Moreover, this method entails computing a minimum spanning tree (MST) in advance which may take (N2) time. Another algorithm is described in Kim, et al., “Walking on Minimax paths for k-NN search,” Proc. 27th AAAI Conf. on Artificial Intelligence, pp. 518-525, 2013, hereinafter, “Kim 2013.” The greedy algorithm proposed in Kim 2013 computes K-nearest neighbors by space partitioning and using Fibonacci heaps whose run time is (log N+K log K). However, this method is limited to Euclidean spaces and assumes the graph is sparse.
The exemplary embodiment provides a system and method for K-nearest neighbor searching based on Minimax distances that is computationally efficient and applicable to a wide range of base distance measures.