Text classification, the task of automatically assigning categories to natural language text, has become one of the key methods for organizing online information. Automated text classification is a particularly challenging task in modern data analysis, both from an empirical and from a theoretical perspective. This problem is of central interest in many internet applications, and consequently it has received attention from researchers in such diverse areas as information retrieval, machine learning, and the theory of algorithms. Challenges associated with automated text categorization come from many fronts: an appropriate data structure must be chosen to represent the documents; an appropriate objective function must be chosen to optimize in order to avoid over fitting and obtain good generalization; and algorithmic issues arising as a result of the high formal dimensionality of the data must be addressed.
Feature selection of a subset of the features available for describing the data before applying a learning algorithm is a common technique for addressing this last challenge. See for example, A. L. Blum and P. Langley, Selection of Relevant Features and Examples in Machine Learning, Artificial Intelligence, 97:245-271, 1997; G. Forman, An Extensive Empirical Study of Feature-Selection Metrics for Text Classification, Journal of Machine Learning Research, 3:1289-1305, 2003; and I. Guyon and A. Elisseeff, An Introduction to Variable and Feature Selection, Journal of Machine Learning Research, 3:1157-1182, 2003. It has been widely observed that feature selection can be a powerful tool for simplifying or speeding up computations, and when employed appropriately it can lead to little loss in classification quality. Nevertheless, general theoretical performance guarantees are modest and it is often difficult to claim more than a vague intuitive understanding of why a particular feature selection algorithm performs well when it does. Indeed, selecting an optimal set of features is in general difficult, both theoretically and empirically, and in practice greedy heuristics are often employed.
Recent work in applied data analysis—for example, work on Regularized Least Squares Classification (RLSC), Support Vector Machine (SVM) classification, and the Lasso shrinkage and selection method for linear regression and classification employ the Singular Value Decomposition, which, upon truncation, results in a small number of dimensions, each of which is a linear combination of up to all of the original features. See for example, D. Fragoudis, D. Meretakis, and S. Likothanassis, Integrating Feature and Instance Selection for Text Classification, In Proceedings of the 8th Annual ACM SIGKDD Conference, pages 501-506, 2002, and T. Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features, In Proceedings of the 10th European Conference on Machine Learning, pages 137-142, 1998. Although RLSC performs comparable to the popular SVMs for text categorization, RLSC is conceptually and theoretically simpler than SVMs, since RLSC can be solved with vector space operations instead of convex optimization techniques required by SVMs. In practice, however, RLSC is often slower, in particular for problems where the mapping to the feature space is not the identity. For a nice overview, see R. Rifkin, Everything Old Is New Again: A Fresh Look at Historical Approaches in Machine Learning, PhD thesis, Massachusetts Institute of Technology, 2002, and R. Rifkin, G. Yeo, and T. Poggio, Regularized Least-Squares Classification, in J. A. K. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle, editors, Advances in Learning Theory: Methods, Models and Applications, NATO Science Series III: Computer and Systems Sciences, pages 131-154. VIOS Press, 2003.
What is needed is a system and method for RLSC to efficiently learn classifications function and perform feature selection to find a small set of features that may preserve the relevant geometric structure in the data. Such a system and method should be able to be used by online applications for text classification where the text content may change rapidly.