1. Field of the Invention
The present invention relates to a system and method for effectively using a Support Vector Machine (SVM) to perform classification into multiple categories. In particular, the present invention relates to an improved system and method for applying SVM multi-classification techniques to computationally solve real-world problems.
2. Description of the Related Art
Multi-class classification problems pose a variety of issues, and applying SVMs in particular to multi-class classification problems presents many difficulties. The original “hard margin” algorithm is designed to determine a single hyperplane between two classes, known as the “maximum margin hyperplane.” However, this algorithm does not efficiently and reliably define such hyperplanes if the classification problem includes training data with overlapping distributions, making it unsuitable for many real-world problems. The “soft margin” algorithm was later developed to lift this restriction, but this introduced a new problem. The soft margin algorithm contains a “user-definable” parameter. This parameter, known as the “cost factor,” must be set outside of the SVM training algorithm in order to provide the algorithm with a correct tradeoff between memorization and generalization. The concept of a cost factor is not unique to SVM classification problems but, rather, is a more general concept related to pattern recognition machine learning. In the context of SVM classification, determining or calculating the cost factor typically requires more information than would otherwise be necessary to train a maximum margin hyperplane.
Prior art methods and systems have provided some minor improvements and modifications to the SVM algorithms to extend these algorithms to the multi-class case. However, the multi-class algorithms known to date are more computationally intensive than even the soft margin formulation, discussed above. Therefore, much work remains to be done to make these multi-class algorithms more computationally manageable. Additionally, there has not been much study on the theoretical properties of these multi-class algorithms, which raises some doubts as to their accuracy and reliability. For example, the generalization properties or asymptotic behavior modeled by the multi-class algorithms have not been studied and verified to the same degree as the original hard-margin and soft-margin SVM algorithms.
Common alternatives exist where multi-class decisions are subdivided into many binary problems. For example, a single (binary) SVM classifier is used for each two class problem then the results are combined back together to make a final decision. There are many algorithms known to those skilled in the art for doing this combination—two of the most popular are known as the “one vs. rest” and the “all pairs” approaches. The “one vs. rest” approach involves using a classifier to separate every category from all the other categories. The idea is to generate a set of SVMs that indicate class membership individually. The problem of resolving multiple category classification is a bit ambiguous but there are a variety of tie breaking schemes that are known. Similarly, the “all pairs” approach uses an SVM for every pair of classes, and lets every SVM vote to determine the final destination of a new item being classified. There are also various voting schemes known to those of ordinary skill in the art. See, e.g. Allwein1, Bishop1, Dietterich1, Platt3, Zadrozny2.
The output of an SVM classifier is a “score” with little value outside of the SVM compared to a true probability. A positive score means the SVM assigns the new example to one class, and a negative score indicates assignment to the other class. This motivates the names “negative class” and “positive class” used to discuss the two classes being considered by a binary classifier. While the sign of the score plays a role in determining which class the SVM would assign the new: example to, the magnitude of the score is less informative than a probability. The score gets larger if an example “belongs more” to that class than other examples, and it will get smaller if the example “belongs more” to the other class. Thus a high negative score signifies a high confidence that the SVM really believes the example belongs in the negative class, and a high positive score signifies a high confidence that the example belongs in the positive class.
This is unacceptable for broader application of SVMs however, because it is commonly known that having a classifier output probabilities of class membership are far more useful and effective. See, for example, Bishop1, listed in the table of Appendix 1. There are ways to convert SVM scores into probabilities, and these methods are known to those skilled in the art, as described in Platt1, Zadrozny1, Zadrozny2 and Sollich1.
The final problem is that sometimes the relationships between features used and class memberships are not linear. This motivates the kernel component of the SVM, which allows mapping features into nonlinear spaces representing much richer representations. This raises an issue of how to measure the appropriateness of the current representation, and how to know if the current set of features is a good one. If not, something else should be tried, or at a minimum, the system should report a diagnostic indicating the lack of confidence in its suitability to the problem. Classification algorithms have a hard time leaving the laboratory without this kind of feedback in order to be considered for industrial use.
Support Vector Machines are repeatedly more accurate than other classifiers, especially for sparse problems (e.g., small number of training documents or examples) with lots of features. See, e.g., Joachims1, Platt1, Sollich1, Dumais1, Hearst1. Additionally, there have been many advances in speeding up their training time, which have drastically improved the training computational requirements. See, e.g., Platt2, Keerthi1 and Joachims2.
Multi-class classification using the “one vs. all” and “one vs. rest” approach are already well known to those skilled in the art. See, e.g., Bishop1. Error correcting output codes (ECOC) have been shown to provide more accurate classification results when using linear classifiers. See, e.g., Dietterich1. More recently, this ECOC paradigm has been extended to include other code matrix representations and has shown more accurate and unified approaches to multi-class classification using binary classifiers. See, e.g., Allwein1, Zadrozny2.
It is commonly known to those skilled in the art that calibrating the output of a classification function is useful. See, e.g., RL1, Bishop1. How one calibrates the outputs of a classifier has been shown to be implementation dependent and has also been shown to depend on the classification algorithm in order to be effective. See, e.g., Platt1, Sollich1, PriceKnerr1, Bishop1, Zadrozny1.
In fact, recent work has even focused on combining the multi-class classification code matrix representation with the calibration of probability outputs. See, e.g., Zadrozny2.
Measuring a kernel's suitability to a problem's representation has also been the focus of much recent research. Most of this research however involves designing kernels to suite a problem better and not to measure whether or not a kernel's application is appropriate. Measuring a kernel's effectiveness on a problem can be handled relatively well by using a holdout or validation set if enough training examples are available. See, e.g., Bishop1. If a chosen kernel provides a poor performance measure then we know that the kernel is ineffective for this problem. However, determining the source of ineffectiveness is still hard to do, and it is unknown whether more training examples will solve the problem or not.
The term “hyperplane” is used herein in accordance with its ordinary technical meaning and, as known to those of ordinary skill in the art, refers to a linear equation of possibly many dimensions. A hyperplane in two dimensions is also often referred to as a line and a hyperplane in three dimensions is often referred to as a plane. When more than three dimensions are involved, the hyperplane is typically only referred to as a hyperplane.
The term “optimization,” as used herein, refers to the practice, known to those of ordinary skill, of finding parameters for a function that yield a desired value, range of values, or an extreme value (e.g., a minimum or maximum) for the function's output value.
As used herein, a “kernel” refers to a Mercer Kernel which is a function between two vectors that defines a distance in accordance with the very general Mercer conditions. This class of functions are well known to those skilled in the art. See, e.g. Vapnik1. One particular kind of kernel is a “sigmoid function” known to those of ordinary skill to “squash” its domain to a continuous interval. The term “sigmoid” means “S-shaped.” A detailed description of two kinds of sigmoid functions are in Bishop.
As used herein the term “transformation of feature vectors” is used in accordance with its ordinary meaning as understood by those of skill in the art and refers to changing a feature vector in accordance with a desired mathematical function.