1. Field of the Invention
The present invention generally relates to supervised learning as applied to text categorization and, more particularly, to a method for categorizing messages or documents containing text.
2. Background Description
The text categorization problem is to determine predefined categories for an incoming unlabeled message or document containing text based on information extracted from a training set of labeled messages or documents. Text categorization is an important practical problem for companies that wish to use computers to categorize incoming e-mail (electronic mail), thereby either enabling an automatic machine response to the e-mail or simply assuring that the e-mail reaches the correct human recipient. But, beyond e-mail, text items to be categorized may come from many sources, including the output of voice recognition software, collections of documents (e.g., news stories, patents, or case summaries), and the contents of web pages.
Any data item containing text is referred to as a document, and the term herein is meant to be taken in this most general sense.
The text categorization problem can be reduced to a set of binary classification problemsxe2x80x94one for each categoryxe2x80x94where for each one wishes to determine a method for predicting in-class versus out-of-class membership. Such supervised learning problems have been widely studied in the past. Recently, many methods developed for classification problems have been applied to text categorization. For example, Chidanand Apte, Fred Damerau, and Sholom M. Weiss in xe2x80x9cAutomated learning of decision rules for text categorizationxe2x80x9d, ACM Transaction on Information Systems, 12:233-251 (1994), applied an inductive rule learning algorithm, SWAP1, to the text categorization problem. Yiming Yang and Christopher G. Chute in xe2x80x9cAn example-based mapping method for text categorization and retrievalxe2x80x9d, ACM Transactions on Information Systems, 12:252-277 (1994), proposed a linear least square fitting algorithm to train linear classifiers. Yiming Yang also compared a number of statistical methods for text categorization in xe2x80x9cAn evaluation of statistical approaches to text categorizationxe2x80x9d, Information Retrieval Journal, 1:66-99 (1999). The best performances previously reported in the literature are from weighted resampled decision trees (i.e., boosting) in xe2x80x9cMaximizing text-mining performancexe2x80x9d by S. M. Weiss, C. Apte, F. Damerau, D. E. Johnson, F. L. Oles, T. Goetz, and T. Hampp, IEEE Intelligent Systems, 14:63-69 (1999), and support vector machines in xe2x80x9cInductive learning algorithms and representations for text categorizationxe2x80x9d by S. Dumais, J. Platt, D. Heckerman, and M. Sahami, Technical Report, Microsoft Research (1998). However, training the classifier in these approaches is much slower than the method we will be presenting here.
Common to all these approaches is the use of a numeric vector to represent a document. This can be done in many ways. Because of the vast numbers of different words that may appear in text, generally one gets sparse vectors of very high dimensionality as document representations. Thus, text categorization necessitates using techniques of supervised learning that are well suited to high dimensional data.
Formally, a two-class pattern recognition problem is to determine a label yxcex5{xe2x88x921, 1} associated with a vector x of input variables. A useful method for solving this problem is by using linear discriminant functions, which consist of linear combinations of the input variables. Various techniques have been proposed for determining the weight values for linear discriminant classifiers from a training set of labeled data (x1, y1), . . . , (xn, yn). Here, and throughout this document, n is the number of items in a training set. Specifically, desired are a weight vector w and a threshold t such that wTx greater than t if its label y=xe2x88x921 and wTxxe2x89xa7t if its label y=1. Here, the notation wTx means the product using matrix multiplication of the transpose of the column vector w and the column vector x, which is the same as the inner product w and x, which is the same as the dot product of w and x. Thus, the hyperplane consisting of all x such that wTx=t would approximately separate the in-class vectors from the out-of-class vectors.
The problem just described may readily be converted into one in which the threshold t is taken to be zero. One does this by embedding all the data into a space with one more dimension, and then translating the original space by some chosen nonzero distance A from its original position. Normally, one takes A=1. Hence, in this conversion, each vector (z1, . . . , zm) is traded in for (z1, . . . , zm, A). For each hyperplane in the original space, there is a unique hyperplane in the larger space that passes through the origin of the larger space. Instead of searching for both an m-dimensional weight vector along with a threshold t, one can therefore search for an (m+1)-dimensional weight vector along with an anticipated threshold of zero.
Under the assumption that the vectors of input variables have been suitably transformed so that we may take t=0, the training error rate for a linear classifier with weight vector w is given by                                           1            n                    ⁢                                    ∑                              i                =                1                            n                        ⁢                          xe2x80x83                        ⁢                          f              ⁡                              (                                                      w                    T                                    ⁢                                      x                    i                                    ⁢                                      y                    i                                                  )                                                    ,                            (        1        )            
where f is the step function   "AutoLeftMatch"                                          f            ⁡                          (              x              )                                =                      {                                                                                                                              1                        ⁢                                                  xe2x80x83                                                ⁢                        if                        ⁢                                                  xe2x80x83                                                ⁢                        x                                            ≤                      0                                        ,                                                                                                                                          0                      ⁢                                              xe2x80x83                                            ⁢                      if                      ⁢                                              xe2x80x83                                            ⁢                      x                                         greater than                     0.                                                                                                            (          2          )                    
A number of approaches to solving categorization problems by finding linear discriminant functions have been advanced over the years. In the early statistical literature, the weight was obtained by using linear discriminant analysis, which makes the assumption that each class has a Gaussian distribution (see, for example, B. D. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press (1996), chapter 3). Similar to linear discriminant analysis, an approach widely used in the neural net community is the least square fitting algorithm. Least square fitting has been applied to text categorization problems as described by Yiming Yang et al., supra. Without any assumption on the distribution, a linear separator can be obtained by using the perceptron algorithm that minimizes training error as described by M. L. Minsky and S. A. Papert in Perceptrons, Expanded Ed., MIT Press, Cambridge, Mass., (1990).
In an attempt to deal with the problem of overfitting the training data, some newer techniques have a theoretical basis in the analysis of the generalization error of classification methods that aim to minimize the training error. This analysis often involves a concept called VC dimension, which was originally discovered by V. N. Vapnik and A. J. Chervonenkis in xe2x80x9cOn the uniform convergence of relative frequencies of events to their probabilitiesxe2x80x9d, Theory of Probability and Applications, 16:264-280 (1971), and, independently, by N. Sauer in xe2x80x9cOn the density of families of setsxe2x80x9d, Journal of Combinational Theory (Series A), 13:145-147 (1972). However, in general, the VC-dimension is proportional to the dimension d of the underlying variable x. Vapnik later realized that by restricting the magnitude of the weights, it is possible to achieve generalization performance which is independent of d (V. N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York (1995)). The idea of restricting the magnitude of the weights has been known in the neural net community, and was analyzed by P. L. Bartlett in xe2x80x9cThe sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the networkxe2x80x9d, IEEE Transactions on Information Theory, 44(2):525-536 (1998).
It is also known that the generalization performance of a linear classifier trained to minimize the training error is determined by its capacity, which can be measured by the concept of covering number, originally studied by A. N. Kolmogorov and V. M. Tihomirov, xe2x80x9cxcex5-entropy and xcex5-capacity of sets in functional spacesxe2x80x9d, Amer. Math. Soc. Transl., 17(2):277-364 (1961). In learning theory, the VC-dimension is used to bound the growth rate of covering numbers as a function of sample size. It can be shown that the average generalization performance of a linear classifier obtained from minimizing training error is Õ({square root over (d/n)})more than the optimal generalization error when the training set consisted of n examples. (The notation Õ here indicates that the hidden factor may have a polynomial dependence on log(n).) Clearly, if d is large as compared to n, then the generalization performance from the perceptron algorithm will be poor. Unfortunately, large dimensionality is typical for many real-word problems such as text-classification problems, which can have tens of thousands of features. Vapnik realized that by using regularization techniques originated from the numerical solution of ill-posed systems, as described, for example, by A. N. Tikhonov and V. Y. Arsenin in Solution of Ill-Posed Problems, W. H. Winston, Washington, D.C. (1977), one can avoid the dimensional dependency and thus achieve better generalization performance for certain problems, as described in V. N. Vapnik, Estimation of Dependencies Based on Empirical Data, Springer-Verlag, New York (1982), translated from the Russian by Samuel Kotz, and V. N. Vapnik, The Nature of Statistic Learning Theory, Springer-Verlag, New York (1995).
In The Nature of Statistical Learning Theory, Vapnik examined another method to train linear classifiers which he called the xe2x80x9coptimal separating hyperplanexe2x80x9d algorithm. The algorithm restricts the 2-norm of the vector of weights and it produces a linear classifier that gives a proper separation of the two classes, assuming there exists a perfect separation by a hyperplane. A quadratic programming formulation has been derived accordingly, which he called the support vector machine, or SVM. It has been demonstrated that the VC dimension associated with the SVM formulation depends on sup ∥w∥2 sup ∥x∥2 where sup ∥w∥2 is the maximum of the 2-norms of the weights of the linear classifiers under consideration, and sup ∥x∥2 is the maximum 2-norm of training data. More recently, Bartlett, supra, has studied the generalization performance of restricting 1-norm of the weight and ∞-norm of the data, and he obtained for his approach a generalization performance of Õ(log(d)/{square root over (n)} more than the optimal generalization error. A similar argument has been applied by R. E. Schapire, Y. Freund, P. Bartlett, and Wee Sun Lee in xe2x80x9cBoosting the margin: a new explanation for the effectiveness of voting methodsxe2x80x9d, The Annals of Statistics, 26:1651-1686 (1998) to explain the effectiveness of the boosting algorithm.
It is therefore an object of the present invention to provide a method to automatically categorize messages or documents containing text.
According to the invention, a method of solution fits in the general framework of supervised learning, in which a rule or rules for categorizing data is automatically constructed by a computer on the basis of training data that has been labeled beforehand. More specifically, the method involves the construction of a linear separator: training data is used to construct for each category a weight vector w and a threshold t, and the decision of whether a hitherto unseen document d is in the category will depend on the outcome of the test
iTxxe2x89xa7t,
where x is a vector derived from the document d. The method also uses a set L of features selected from the training data in order to construct the numerical vector representation x of a document.
The technique employed for obtaining the weight vector and a preliminary value for the threshold from the training data is new, and so consequently, is also the application of this technique to text. Also not taught by the prior art is the step in which the preliminary value of the threshold in a linear classifier can be altered to improve the performance of the classifier as measured by the text-related measures of precision and recall (or other measures), as opposed to the measure of classification error that is pervasive in supervised learning. If the measure of importance were classification error, then altering the threshold after its initial determination would be somewhat at odds with the derivation of the technique.
In summary the present invention is a method for categorizing text by modifying the training error for a set of training documents in order to derive a convex optimization problem from the modified training error. Once the problem has been defined, it is regularized and then solved by relaxation.