Text classification, the task of automatically assigning categories to natural language text, has become one of the key methods for organizing online information. Most modern approaches to text classification employ machine learning techniques to automatically learn text classifiers from examples. A large number of text classification problems occurring in practice involve many categories. They may be a multi-class type assigning exactly one class to each document or a multi-labeled type assigning a variable number of classes to each document. Typically these problems involve a very large feature space where the features consist of a large vocabulary of words and phrases. The features representing a document may be many times the size of the representation of a document. Unfortunately, processing such a large feature set exhausts computational resources.
Feature selection is an important component of text classification with machine learning techniques. It is used to help reduce the load on computational resources and, in cases where there are many noisy features, to help in lifting the performance by eliminating such features. Several feature selection methods have been suggested in the literature, particularly with respect to binary classification. In general, feature selection methods have been categorized into three types: filter, wrapper and embedded methods. See for example, I. Guyon and A. Elisseeff, An Introduction to Variable and Feature Selection, Journal of Machine Learning Research, 3:1157-1182, 2003. Filter methods select features as a pre-processing step, independently of the prediction method. Because text classification involves a large number of features and filter methods are computationally very efficient, they have been popularly used in text classification. For comparisons of a number of filter methods for text classification, see Y. Yang and J. Pedersen, A Comparative Study on Feature Selection in Text Categorization, in International Conference on Machine Learning, 1997, and G. Forman, An Extensive Empirical Study of Feature Selection Metrics for Text Classification, Journal of Machine Learning Research, 3:1289-1305, 2003. These studies show information gain, chi-squared and bi-normal separation as the leading filter measures. Wrapper methods use the prediction method as a black box to score subsets of features. In text classification they have not been tried because of their expensive need to try out a very large number of subset selections. Finally, embedded methods perform feature selection as part of the training process of the prediction method.
Support Vector Machines (SVMs) are an important class of methods for generating text classifiers from examples. SVMs combine high performance and efficiency with improved robustness. Embedded methods for feature selection with SVMs include linear classifiers that use L1 regularization on the weights and recursive feature elimination, a backward elimination method that uses smallness of weights to decide feature removal. See for example, D. Madigan, A. Genkin, D. D. Lewis, S. Argamon, D. Fradkin, and L. Ye, Author Identification on the Large Scale, In Classification Society of North America, 2005, and I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene Selection for Cancer Classification Using Support Vector Machines, Machine Learning, 46(1/3):389, 2002. Unfortunately, feature selection is performed independently for the various binary classifiers. Because features are removed on a class by class basis, the importance of a feature to other classes is not considered when removing features.
What is needed is a system and method for an SVM to learn classifications function and perform simultaneous feature selection to find a small set of features which are good for all the classifiers. Such a system and method should be able to be used by online applications for multi-class text classification where the text content may change rapidly.