The present invention relates generally to machine learning, and more particularly to transductive support vector machines.
Machine learning involves techniques to allow computers to “learn”. More specifically, machine learning involves training a computer system to perform some task, rather than directly programming the system to perform the task. The system observes some data and automatically determines some structure of the data for use at a later time when processing unknown data.
Machine learning techniques generally create a function from training data. The training data consists of pairs of input objects (typically vectors), and desired outputs. The output of the function can be a continuous value (called regression), or can predict a class label of the input object (called classification). The task of the learning machine is to predict the value of the function for any valid input object after having seen only a small number of training examples (i.e. pairs of input and target output).
One particular type of learning machine is a support vector machine (SVM). SVMs are well known in the art, for example as described in V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998; and C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery 2, 121-167, 1998. Although well known, a brief description of SVMs will be given here in order to aid in the following description of the present invention.
Consider the classification shown in FIG. 1 which shows data labeled into two classes represented by circles and squares. The question becomes, what is the best way of dividing the two classes? An SVM creates a maximum-margin hyperplane defined by support vectors as shown in FIG. 2. The support vectors are shown as 202, 204 and 206 and they define those input vectors of the training data which are used as classification boundaries to define the hyperplane 208. The goal in defining a hyperplane in a classification problem is to maximize the margin (m) 210 which is the distance between the support vectors of each different class. In other words, the maximum-margin hyperplane splits the training examples such that the distance from the closest support vectors is maximized. The support vectors are determined by solving a quadratic programming (QP) optimization problem. There exist several well known QP algorithms for use with SVMs, for example as described in R. Fletcher, Practical Methods of Optimization, Wiley, New York, 2001; M. S. Bazaraa, H. D. Shrali and C. M. Shetty, Nonlinear Programming: Theory and Algorithms, Wiley Interscience, New York, 1993; and J. C. Platt, “Fast Training of Support Vector machines using Sequential Minimal Optimization”, Advances in Kernel Methods, MIT press, 1999. Only a small subset of the of the training data vectors (i.e., the support vectors) need to be considered in order to determine the optimal hyperplane. Thus, the problem of defining the support vectors may also be considered a filtering problem. More particularly, the job of the SVM during the training phase is to filter out the training data vectors which are not support vectors.
As can be seen from FIG. 2, the optimal hyperplane 208 is linear, which assumes that the data to be classified is linearly separable. However, this is not always the case. For example, consider FIG. 3 in which the data is classified into two sets (X and O). As shown on the left side of the figure, in one dimensional space the two classes are not linearly separable. However, by mapping the one dimensional data into 2 dimensional space as shown on the right side of the figure, the data becomes linearly separable by line 302. This same idea is shown in FIG. 4, which, on the left side of the figure, shows two dimensional data with the classification boundaries defined by support vectors (shown as disks with outlines around them). However, the class divider 402 is a curve, not a line, and the two dimensional data are not linearly separable. However, by mapping the two dimensional data into higher dimensional space as shown on the right side of FIG. 4, the data becomes linearly separable by hyperplane 404. The mapping function that calculates dot products between vectors in the space of higher dimensionality is called a kernel and is generally referred to herein as k. The use of the kernel function to map data from a lower to a higher dimensionality is well known in the art, for example as described in V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.
Transductive support vector machines (TSVMs) are learning machines which improve the generalization accuracy of SVMs by using unlabeled data in addition to labeled data. Consider the data shown in FIG. 5 which shows labeled data labeled as a circle or a square, and unlabeled data which is represented as ‘X’. The class of the unlabeled data is unknown and could be either of the classes (circle or square). TSVMs, like SVMs, create a large margin hyperplane classifier using the labeled training data, but simultaneously force the hyperplane to be as far away from the unlabeled data as possible. TSVMs can give considerable improvement over SVMs in situations in which the number of labeled points of the training data is small and the number of unlabeled points is large.
However, conventional implementations of TSVMs often suffer from an inability to efficiently deal with a large number of unlabeled examples. The first implementation of TSVM, described in K. Bennett and A. Demiriz, “Semi-Supervised Support Vector Machines”, Advances in Neural Information Processing Systems 12, pages 368-374, MIT Press, Cambridge, Mass., 1998, uses an integer programming method, which is intractable for large problems. A combinatorial approach, known as SVMLight TSVM, is described in T. Joachims, “Transductive Inference for Text Classification Using Support Vector Machines”, International Conference on Machine Learning, ICML, 1999, and is practical for no more than a few thousand examples. A sequential optimization procedure, described in G. Fung and O. Mangasarian, “Semi-Supervised Support Vector Machines for Unlabeled Data Classification”, Optimisation Methods and Software, pages 1-14, Kluwer Academic Publishers, Boston, 2001, could potentially scale well, although their largest experiment used only 1000 examples. However, this sequential optimization procedure was for linear cases only, and used a special SVM with a 1-norm regularizer to retain linearity. A primal method, described in O. Chapelle and A. Zien, “Semi-Supervised Classification by Low Density Separation”, Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, 2005, shows improved generalization performance over previous approaches, but still scales as (L+U)3, where L and U are the numbers of labeled and unlabeled examples, respectively. This method also stores the entire (L+U)×(L+U) kernel matrix in memory.
Although TSVMs are powerful regression and classification tools, they suffer from the inability to efficiently deal with a large number of unlabeled examples. What is needed is a technique which improves TSVM performance and scales well, even in view of a large amount of unlabeled data.