A frequent problem in signal and data processing is one of classification. For many applications, e.g., astronomy, meteorology, medicine, and image processing, samples of the input signal or input data set need to be separated into two distinct classes. For example, in vision systems where faces are recognized from images, it is often desired to classify a face as either female or male. In clinical trials, classification of statistical data can be used to study diseases. To solve this type of problem, binary classifier or discriminants are used. When the signal is complex, e.g., a signal where data samples have a high dimensionality and a simple classifier is not directly obvious, it is first necessary to “learn” the discriminant from a training signal or data samples themselves.
In machine learning of binary classifiers, two techniques are most commonly used: boosting and kernels. One well known boosting algorithm is Adaboost, see Freund et al., “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, 55, pp. 119–139, 1995, Shapire et al., “Boosting the margin: a new explanation for the effectiveness of voting methods,” Proc. 14th Inter. Conf. on Machine Learning, pp. 322–330, 1997, and Scholkopf et al., “Nonlinear component analysis as a kernel eigenvalue problem,” Neural-Computation, 10, pp. 1299–1319, 1998.
Boosting is used to build strong classifiers from a collection of weak classifiers that usually perform only slightly better than chance. Analysis of AdaBoost and other “voting” type classification methods have explained the apparent tendency of boosting to maximize the margin in the resulting classifier, thus preventing overfitting and improving the generalization performance. However, maximizing the margin directly is a relatively complex optimization task.
Mercer kernels have been used as an implicit mapping mechanism which, when used for classification tasks, make linear discriminants in a transformed feature space correspond to complex non-linear decision boundaries in the input or training data space, see Boser et al., “A training algorithm for optimal margin classifiers,” Proc. 5th Annual ACM Workshop on Computational Learning Theory, pp. 144–152, 1992. Kernel methods have also been used to build non-linear feature spaces for principal component analysis (PCA), as well as Fisher's linear discriminant analysis (LDA).
The most familiar example of kernel mapping is used in non-linear support vector machines (SVMs), see Vapnik, “The nature of statistical learning theory,” Springer, 1995. In SVMs, the classification margin, and thus the bound on generalizaton, is maximized by a simultaneous optimization with respect to all the training samples. In SVMs, samples can be quite close to the decision boundary, and the support vectors are simply the minimal number of training samples needed to build, i.e., support, the decision boundary. The support vectors are almost certainly not “typical” or high-likelihood members of either class. Also, in the case of SVMs, there is usually no direct way of controlling the number of support vectors that are produced.
Therefore, there is a need for a classification method that simplifies the optimization task. It is also desired to bound the number of discriminants used during classification.