1. Field of Invention
The present invention relates to the field of identifying a maximum margin classifier for classification of data, and is specifically directed to a method of optimizing the identifying of such a maximum margin classifier when analyzing a large set of data points.
2. Description of Related Art
Non-negative matrix factorization (NMF) is generally a group of algorithms in multivariate analysis (i.e. having more than one variable) and linear algebra where a matrix, X, is factorized into (usually) two matrices, W and H such that NMF (X)→WH.
Non-negative matrix factorization (NMF) has been shown to be a useful decomposition for multivariate data, and NMF permits additive combinations of non-negative basis components.
Factorization of matrices is generally non-unique, and a number of different methods of doing so have been developed (e.g. principle component analysis and singular value decomposition) by incorporating different constraints. Non-negative matrix factorization differs from these methods in that it enforces an additional constraint of having the factors W and H be non-negative, i.e., all elements in matrices W and H must be equal to or greater than zero.
In approximating non-negative matrix factorization, usually the number of columns of W and the number of rows of H are selected so that their product, WH, will be an approximation of X, since a residual error U may remain. The full decomposition of X, therefore, may more generally be defined as the two non-negative matrices W and H plus a residual error, U, such that: X=WH+U.
One of the reasons for factoring a matrix X is that when W and H are smaller than X, it can be easier to store and manipulate W and H, rather than X. Non-negative matrix factorization (NMF) has further been shown recently to be useful for many applications in pattern recognition, multimedia, text mining, and DNA gene expressions.
NMF can be traced back to the 1970s, and is described in “Positive Matrix Factorization: A Non-Negative Factor Model With Optimal Utilization of Error Estimates of Data Values”, Environmetrics, volume 5, pages 111-126, 1994, by P. Paatero and U. Tapper (hereby incorporated in its entirety by reference). NMF is further described in “Learning the Parts of Objects By Non-negative Matrix Factorization”, Nature, volume 401, pages 788-791, 1999 by Lee and Seung, which is hereby incorporated in its entirety by reference, and in “Algorithms for Non-negative Matrix Factorization”, NIPS, pages 556-562, 2000, also by Lee and Seung and also hereby incorporated in its entirety by reference. The work of Lee and Seung, in particular, brought much attention to NMF in machine learning and data mining fields.
Support vector machines are a set of related supervised learning methods used for data classification and regression. A support vector machine constructs a hyperplane in a high-dimensional space, which can be used for data classification, data regression or other tasks.
A hyperplane is a concept in geometry, and it is a generalization of the concept of a plane into higher dimensions. Analogous with a plane which defines a two-dimensional subspace in a three-dimensional space, a hyperplane defines an m-dimensional subspace within a q-dimensional space, where m<q. A line, for example, is a one-dimensional hyperplane in a higher dimension space.
High dimensional hyperplanes share many mathematical properties in common with regular lines and planes. The main idea in using a hyperplane in data analysis is to construct a divide (i.e. a hyperplane) that separates clusters of data points, or vectors, (i.e. separates data points into different classes). These separated data point clusters can then be used for data classification purposes. Intuitively, a good separation is achieved by the hyperplane that has the largest distance (i.e. functional margin) to the nearest training data points of the different classes, since in general, the larger the functional margin, the lower the generalization error of the classifier.
Classifying data is a common task in machine learning. For example, if each data point in an existing sample of data points can be designated as belonging to one of two classes, a goal may be to decide to which class a newly received data point will belong. In the case of support vector machines, each data point may be viewed as a p-dimensional vector (i.e., a list of p numbers), and the goal is to determine whether such points can be separated with a (p−1)-dimensional hyperplane. This may be termed linear classification. In general, there are many hyperplanes that might classify the data (i.e. may separate the data into classifications, or data clusters), but one hyperplane may offer optimal separation.
For example, FIG. 1 shows a 2-dimensional space with eighteen data points (or vectors) separated into two clusters of nine data points, each. A first data cluster of nine data points is shown as darkened data points, and a second data cluster of nine data points is shown as lightened data points. Three candidate hyperplanes 11, 13, and 15 (i.e. three lines in the present 2-dimensional example) separate the eighteen data points into two groups, or classes, of data points, but one of the three candidate hyperplanes offers the best data-point separation.
In the present example, hyperplane 13 separates four darkened data points on its left (side A) from five darkened data points and nine lightened data points on its right (side B). In order to obtain meaningful information, however, it is helpful to divide the data points into data clusters since the data points in each data cluster is likely to have some similar attributes. In the present case, it is relatively self-apparent that hyperplane 13 does not provide useful information regarding similarities or differences between the data points since it does not accurately differentiate between the two data clusters.
Hyperplane 11 does separate the first data cluster (consisting of nine darkened data points) on its upper side (side C) from the second data cluster (consisting of nine lightened data points) on its lower side (side D), but does not provide an optimal separation between the first and second data clusters.
In order to provide meaningful information, it is preferable that the hyperplane that separates the two data clusters provide a maximum separation between the two data clusters. The objective is to choose the hyperplane in which the functional margin (i.e. the distance from the hyperplane to the nearest data point along a line normal to the hyperplane) on each side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane, and such a linear classifier is known as a maximum margin classifier.
In the present example of FIG. 1, margin line 16 defines the border of the first data cluster of darkened data points with reference to hyperplane 15, and margin line 18 defines the border of the second data cluster of lightened data points with reference to hyperplane 15. The data points (or vectors) along margin lines 16 or 18 are typically called support vectors. The bias from the origin to hyperplane 15 is shown as bias term b. Hyperplane 15's functional margin w to margin lines 16 and 18 is likewise shown. In the present case, hyperplane 15 would be the maximum margin classifier since it has the largest functional margin among the three candidate hyperplanes 11, 13, 15.
It should be noted that the topics of non-negative matrix factorization and identification of a maximum margin classifier are separate and distinct. NMF aims to facilitate the storage and manipulation of data by factorizing a large matrix X into two smaller matrices W and H, although one still needs to combine the individual entries in W and H to recover an approximation to the original entries in X. By contrast, identifying a maximum margin classifier for X would entail analyzing the original, individual data entries in X and identifying a hyperplane that provides a maximum margin between data clusters.