From automatic speech recognition to discovering unusual stars, underlying almost all automated discovery tasks is the ability to compare and contrast data. Yet despite the prevalence of computing power and abundance of data, understanding exactly how to perform this comparison has resisted automation.
A key challenge is that most data comparison algorithms today rely on a human expert to specify the important distinguishing “features” that characterize a particular data set. Nearly all automated discovery systems today rely, at their core, on the ability to compare data—from automatic image recognition to discovering new astronomical objects—, such systems must be able to compare and contrast data records in order to group them, classify them, or identify the odd-one-out. Despite rapid growth in the amount of data collected and the increasing rate at which it can be processed, analysis of quantitative data streams still relies heavily on knowing what to look for.
Any time a data mining algorithm searches beyond simple correlations, a human expert must help define a notion of similarity—by specifying important distinguishing features of the data to compare, or by training learning algorithms using copious amounts of examples. Determining the similarity between two data streams is key to any data mining process, but relies heavily on human-prescribed criteria.
Research in machine learning is dominated by the search for good “features”, which are typically understood to be heuristically chosen discriminative attributes characterizing objects or phenomena of interest. The ability of experts to manually define appropriate features for data summarization is not keeping pace with the increasing volume, variety and velocity of big data. Moreover, the number of characterizing features i.e. the size of the feature set, needs to be relatively small to avoid intractability of the subsequent learning algorithms. Such small sets of discriminating attributes are often hard to find. Additionally, their heuristic definition precludes any notion of optimality; it is impossible to quantify the quality of a given feature set in any absolute terms; thus, only allowing a comparison of how it performs in the context of a specific task against a few selected variations.
A number of deep learning approaches have been recently demonstrated that learn features automatically, but typically require large amounts of data and computational effort to train. In addition to the heuristic nature of feature selection, machine learning algorithms typically necessitate the choice of a distance metric in the feature space. For example, the classic “nearest neighbor” k-NN classifier requires definition of proximity, and the k-means algorithm depends on pairwise distances in the feature space for clustering. The choice of the metric crucially impacts both supervised and unsupervised learning algorithms, and has recently led to approaches that learn appropriate metrics from data.
To side-step the heuristic metric problem, a number of recent approaches attempt to learn appropriate metrics directly from data. Some supervised approaches to metric learning can “back out” a metric from side information or labeled constraints. Unsupervised approaches have exploited a connection to dimensionality reduction and embedding strategies, essentially attempting to uncover the geometric structure of geodesics in the feature space (e.g. manifold learning). However, such inferred geometric structures are, again, strongly dependent on the initial heuristic choice of the feature set. Since Euclidean distances between feature vectors are often misleading, heuristic features make it impossible to conceive of a task-independent universal metric in the feature space. While the advantage of considering the notion of similarity between data instead of between feature vectors has been recognized, the definition of similarity measures has remained intrinsically heuristic and application dependent.
Thus, there is a need for an automated, universal metric to estimate the differences and similarities between arbitrary data streams in order to eliminate the reliance on expert-defined features or training. The invention satisfies this need.