The naive Bayes classifier is still a popular method for classification, especially in text classification where it often performs at par with the Support Vector Machines (SVM)-classifier (see Non-Patent Document 1). One advantage of the naive Bayes classifier is that it has the interpretation of a generative model that can be easily extended to model more complex relations (e.g. see Non-Patent Document 3).
In order to learn a naive Bayes classifier, for each class z, we determine the distribution of words that occur in documents belonging to class z. Let us denote the word distribution for class z as θz, and the probability for a specific word w in class z, as θw|z. Often this distribution is modeled by a Multinomial distribution. In order to classifying a new text, the probability of class z given the new text is calculated by multiplying the probabilities θw|z for each word w in the new document.
Note that naive Bayes classifier estimates the probabilities θw|z using only the training data instances (instances with known class). However, words like “I”, or “the”, that occur often in many documents, independent of the class, often introduced noise, and this way the estimates of θw|z get unreliable. One approach is to use a stop-word list to filter out such words. However, such a stop-word list is static and depends on the domain of the documents. Another approach is to weight the words by their inverse-document frequency, as suggested, for example, in Non-Patent Document 1. However, when assigning these weights the interpretation of the naive Bayes classifier as a generative model is lost. As a consequence, the weights and the interaction with parameters of the naive Bayes classifier cannot be learned jointly. Therefore, the weights are either fixed, or must be tuned using part of the training data (for example by using cross-validation).
Another line of research tries to improve classification accuracy, by additionally using instances (e.g. text documents) for which the class is not known. In contrast to training data instances (instances with known class), such additional instances are often available at large quantities. For example, in contrast to a few newspaper articles that are manually annotated with a class (e.g. whether the article is about “Animals” or about “Computer”), there is a vast amount newspaper articles for which not such class information is available (unlabeled instances). Such an approach to learn a classifier is often referred to as “semi-supervised”. The method in Non-Patent Document 2 describes such a semi-supervised approach that can improve the estimation of the probabilities θw|z by using unlabeled instances. Using the Expectation Maximization (EM)-algorithm to assign class probabilities to unlabeled instances, they are able to estimate θw/z for words w that occur in the unlabeled corpus, but do not occur in the training data. However, their approach does not provide a solution to the problem of high-frequent words.