The following relates to clustering and classification apparatuses and methods, machine learning apparatuses and methods, social media systems and methods, and related arts.
Numerous computer-based data processing apparatuses such as document management and retrieval systems, machine learning systems, and so forth manage and/or retrieve data objects based in part on quantitative comparisons between pairs of objects. For example, in clustering similar objects are grouped together into clusters, while a typical retrieval system task is to retrieve the most similar object(s) to a query object. In these applications, “similarity” is measured by quantitative comparisons, typically in the form of pairwise distance measures. For objects that can be represented as vectors of scalar features, a commonly used distance measure is the Euclidean distance. For other types of objects, however, a Euclidean distance may not be readily employed, and other distance metrics are known. Even if a Euclidean distance is usable, it may produce less optimal results as compared with other types of distance measures. Depending on the application, other distance measures such as the Pearson correlation, cosine similarity, Mahalanobis distance, Minkowski distance, Hamming distance, or edit distance may be usable, but these tend to be dependent upon the specific structure of the data, which may not be known or properly assumed.
Depending upon the task, it may be advantageous to pre-compute the distance measures for some or all possible object pairs. If there are N objects in a set, the pre-computed distances are suitably stored in an N×N matrix sometimes referred to as a “similarity matrix”. A problem arises in terms of the high computational complexity of computing the similarity matrix. While a Euclidean distance is rapidly computed, some other types of distances have computational times that scale superlinearly with the number of objects N in the set, e.g. have computational times of O(N3) in some cases. This makes computing the N×N similarity matrix computationally challenging for large values of N.
It would be useful to provide apparatuses such as clustering apparatuses, retrieval apparatuses, machine learning apparatuses, and so forth, with a distance metric component that rapidly computes pairwise distance measures that emphasize structure in the data but also do not make unnecessary assumptions regarding that structure.
Disclosed in the following are improved data mining techniques that provide various benefits as disclosed herein.