Text classification, with a spectrum of applications in Natural Language Processing (NLP) that range from document categorization to information retrieval, is the problem of classifying text documents into topics or pre-defined sets of classes. There has been a tremendous amount of work done on text classification, including techniques based on decision trees, neural networks, nearest neighbor methods, Rocchios method, support vector machines (SVM), linear least squares, Naive Bayes, rule-based methods and more. Some of these methods are unsupervised (no labeled documents—see, e.g., Y. Ko and J. Seo, Automatic Text Categorization by Unsupervised Learning, In Proceedings of COLING-00, the 18th International Conference on Computational Linguistics, incorporated herein by reference) while most of the methods assume a set of document or topic labels (see, e.g., Dasgupta et al., Feature Selection Methods For Text Classification, in Proceedings of the 13th Annual ACM SIGKDD Conference, 2007, pp. 230-239; T. Joachims, Text Categorization With Support Vector Machines: Learning With Many Relevant Features, ECML, 1998; A. McCallum and K. Nigam, A Comparison Of Event Models For Naive Bayes Text Classification, in Proc. of the AAAI-98 Workshop on Learning for Text Classification, AAAI Press, 1998, pp. 41-48; incorporated herein by reference).
McCallum and Nigam showed that even a simple supervised classifier can produce acceptable classification accuracy. They showed that text could be classified by assuming conditional independence between words given the labels and by building a Naive Bayes Classifier. The test document can be classified simply by computing the likelihood of the class label given the words of the document based on Bayes' theorem. Although such a trivial method produced promising results, text classification was further improved by Joachims who presented an SVM-based classifier to classify documents which showed that a more sophisticated algorithm can classify documents better than a simple Naive Bayes approach. Since Joachims work, many more supervised algorithms have been proposed for text classification which are described in detail in E. Sebastiani, Machine Learning In Automated Text Categorization, CoRR, vol. cs.IR/0110053, 2001, incorporated herein by reference.
In a conventional Naïve Bayes (NB) classification approach, given a test document feature vector y, the a posterior probability for class Ci given y is defined as:
      p    ⁡          (                        C          i                |        y            )        =                    p        ⁡                  (                      C            i                    )                    ⁢              p        ⁡                  (                      y            |                          C              i                                )                            p      ⁡              (        y        )            Within the NB framework, the best class is defined as the one which maximizes the posterior probability. In other words,
      i    *    =            max      i        ⁢          p      ⁡              (                              C            i                    |          y                )            where the terms p(Ci) and p(y|Ci) can be estimated as described below, and where the term p(y) can be assumed to be constant across different classes and so typically is ignored.
The prior probability of class Ci is p(Ci), which can be computed on the training set by counting the number of occurrences of each class. In other words if N is the total number of documents in training and N, is the number of documents from class i, then
      P    ⁡          (              C        i            )        =                    N        i            N        .  The term p(y|Ci) can be computed assuming that document y is comprised of the words y={w1, w2, . . . , wn}, where n is the number of words. A “naive” conditional independence assumption is made on the term p(y|Ci)=p(w1, . . . wn|Ci) and it is expressed as:
      P    ⁡          (                        w          1                ,                              …            ⁢                                                  ⁢                          w              n                                |                      C            i                              )        =            ∏              j        =        1            n        ⁢          P      ⁡              (                              w            j                    |                      C            i                          )            Each term P(wj|Ci) is computed by counting the number of times word wj appears in the training documents from class Ci. Typically to avoid non-zero probabilities if word wj, is not found in class Ci, add-one smoothing is used. Thus if we define Nij as the number of times word wj is not found in class Ci, we define P(wj|Ci) as follows, where V is the size of the vocabulary:
      P    ⁡          (                        w          j                |                  C          i                    )        =                    N        ij            +      1                                ∑          i                ⁢                  N          ij                    +      V      The above equations show that instead of making a classification decision on a test document using information about individual examples in training, the Naive Bayes method pools all information about training data to estimate probability models for P(Ci) and P(wj|Ci).
Besides the improvement in the types of classifiers there has been significant work in feature selection for text classification. Some of these feature selection methods are based on information gain, odd ratio, F-measure and Chi-Square testing (see, e.g., Ko and Seo; F. George, An Extensive Empirical Study Of Feature Selection Metrics For Text Classification, Journal of Machine Learning Research, vol. 3, pp. 1289-1305, 2003; incorporated herein by reference). Although the type of feature selection algorithm may vary, it is agreed that feature selection is crucial and improves the performance of a text classifier.