Semi-supervised machine learning involves the ability of a machine to learn a classification or regression function from a set of both labelled and unlabelled sample data points. This is an important problem because in many domains, such as for example image, audio, and text documents, unlabelled data is much easier and cheaper to collect than labelled data. However, a large amount of data is not very useful unless we can determine what the data is or what it relates to. Thus, the ability of a machine to classify unlabelled data provides a significant advantage for processing large amounts of data for a useful purpose. For example, machine-based classification of images is used in a myriad of applications, e.g., face recognition, motion detection, and the like.
The basic idea of semi-supervised machine learning is to learn or estimate (often implicitly) an underlying density function between labelled and unlabelled data points to classify the unlabelled data points. Generally, in most practical applications data points include many variables or dimensions, i.e., the data points belong to a high dimensional space. For example, a digital image may have as many dimensions as there are pixels in the image (e.g., 5 million dimensions). The estimation of density functions in such high dimensional spaces may require exponentially many more examples than the dimensionality (“d”) of the space. Therefore, generally an assumption is made with respect to the relationship between data points in a dataset. A common assumption is that the data points in a dataset, due to the relationships between the data points, form a lower-dimensional structure or manifold in a high-dimensional space.
Generally there are two different approaches for machine-based classification of unlabelled data: the transductive inference (“TI”) approach and the semi-supervised inductive inference (“SSII”) approach. Based on the TI approach, the machine classifies unlabelled data points from a given set of labelled and unlabelled data points. All the data points are provided to the system before the learning commences. Conversely, the SSII approach relies on a training set consisting of both labelled and unlabelled examples, and a separate set containing unlabelled data points only for classification. According to the SSII approach, the training set is first used to construct or learn a function that can then be used to classify the unlabelled data points in the subsequent set.
An important distinction between the TI and SSII approaches is the amount of computational resources required for their implementation. With unlimited resources, an SSII problem can be solved by running a TI algorithm in real time, where data points are provided to the system and classified in real time. However, because of computational resource limitations, processing a training set first makes it so that classifying new examples is substantially less computationally expensive than running a real time TI algorithm. In general, SSII algorithms are not more accurate than TI algorithms, because every SSII algorithm can be trivially viewed as a TI algorithm. In other words, knowing the unlabelled data points before learning begins cannot make classification more difficult. Therefore, SSII algorithms can generally perform only as good as a “corresponding” TI algorithm. Where accuracy is more desirable, TI algorithms are preferred and if they can be made sufficiently fast, they can replace corresponding SSII algorithms. However, TI algorithms operate over a closed set of data points. Thus, where flexibility to introduce new out-of-sample unlabelled data points is desired, SSII algorithms are preferred because they avoid the computational expense of re-learning the density functions for each new data point.
There are a number of algorithms for semi-supervised learning on manifolds. Several of the manifold-learning algorithms are quite similar: work of Bengio et al. (2003) places multi-dimensional scaling (Cox & Cox, 1994), spectral clustering (Ng et al., 2002), Laplacian Eigenmaps (Belkin & Niyogi, 2004), isomap (Tenenbaum et al., 2000), and locally linear embedding (Roweis & Saul, 2000) in a single mathematical framework (all of which are incorporated herein by reference).
One effective approach for semi-supervised machine learning includes the Laplacian Eigenmaps (“LE”) algorithm. The MATLAB code that implements the LE algorithm is available at http://people.cs.uchicago.edu/˜misha/ManifoldLearning/MATLAB/Laplacian.tar and is incorporated herein by reference. The LE algorithm has been demonstrated on the MNIST hand-written digit dataset (available at http://yann.lecun.com/exdb/mnist/index.html). A sample dataset 100 from the MNIST database is shown in FIG. 1. A first set of labelled points 102 is provided and a second set of unlabelled points 104 is to be classified. The LE algorithm was used to perform a digit classification task (as well as on several other tasks) using very few labelled examples (as further detailed below) and showed a reasonably good accuracy.
However, there are several drawbacks to the LE algorithm. The LE algorithm is very computationally expensive. For example, one resource intensive computation LE requires is the computation of the adjacency graph. Using a direct approach, the distance between all pairs of data points is computed, and for each point, the closest neighbors are kept. For a large dataset, the O(n2d) time to compute all the distances dwarfs the time required to keep track of the closest neighbors. This step can be implemented to use only linear memory, but O(n2d) time can be prohibitive for very large problems.
An even more computationally demanding step is the solution of the eigenvalue problem. The LE algorithm requires the computation of an eigendecomposition of an adjacency graph built over the dataset. Although this graph is extremely sparse, interior eigenvectors are required, making the eigendecomposition extremely expensive. For a large, sparse matrix, eigenproblems can be solved, for example based on MATLAB code using the implicitly restarted Amoldi method, an iterative method provided by ARPACK (Lehoucq & Sorensen, 1996; Lehoucq et al., 1998), which are incorporated herein by reference. The largest eigenvalues of a sparse matrix (and their corresponding eigenvectors) can be found rapidly using only sparse matrix-vector multiplications (Golub & Loan, 1996, incorporated herein by reference). However, the eigenvectors corresponding to the smallest eigenvalues of the Laplacian graph matrix (“L”) are required; ARPACK needs to factorize L in the inner loop of the algorithm in order to do this. This factorization will be substantially less sparse than L itself. The factorization can require O(n3) time and O(n2) memory. In practice, a machine with two gigabytes (“GB”) of random access memory (“RAM”) was unable to process the 60,000-point MNIST dataset with the LE algorithm due to lack of memory. This indicates that performing a global eigendecomposition on a very large dataset may well be infeasible in many conventional systems.
Another drawback is that the LE algorithm is a transductive inference TI algorithm without an obvious, computationally-effective way to convert it to an SSII algorithm. There is no obvious way to apply the LE approach to new out-of-sample data points without solving the resource-intensive global eigenvalue problem. Thus LE algorithm is ineffective for labeling new data points that were not part of the initial dataset.
Accordingly, what is needed is a machine learning system and method for semi-supervised learning on manifolds that (1) is less computationally expensive than existing methods, and (2) can provide new point classification without requiring re-computation over the entire dataset.