Text classification is a supervised learning task of assigning natural language text documents to one or more predefined categories or classes according to their contents. While it is a classical problem in the field of information retrieval for a half century, it is currently attracting an increased amount of attention due to an ever-expanding amount of text documents available in digital data format. Text classification is used in numerous fields including, for example, auto-processing of emails, filtering of junk emails, cataloguing Web pages and news articles, etc.
Text classification algorithms that utilize supervised learning typically require sufficient training data so that an obtained classification model can be used for sufficient generalization. As the amount of training data for each class decreases, the classification accuracy of traditional text classification algorithms dramatically degrades. In practical applications, labeled documents are often very sparse because manually labeling data is tedious and costly, while there are often abundant unlabeled documents. As a result, there is much interest in exploiting unlabeled data in text classification. The general problem of exploiting unlabeled data in supervised learning leads to a semi-supervised learning or labeled-unlabeled problem in different context.
The problem, in the context of text classification, could be formalized as follows. Each sample text document is represented by a vector x∈d. We are given two datasets Dl and Du. Dataset Dl is a labeled dataset, consisting of data samples (xi, ti), where 1≦i≦n, and ti is the class label with 1≦ti≦c. Dataset Du is an unlabeled dataset, consisting of unlabeled sample data xi, n+1≦i≦n+m. The semi-supervised learning task is to construct a classifier with small generalization error on unseen data based on both Dl and Du. There have been a number of work reported in developing semi-supervised text classification recently.
While it has been reported that those methods obtain considerable improvement over other supervised methods when the size of training dataset is relatively small, these techniques are substantially limited when the labeled dataset is relatively small, for instance, when it contains less than ten (10) labeled examples in each class. This is not unexpected, since these conventional techniques (e.g. co-training, TSVM and EM) typically utilize a similar iterative approach to train an initial classifier. This iterative approach is heavily based on the distribution presented in the labeled data. When the labeled data includes a very small number of samples that are distant from corresponding class centers (e.g., due to high dimensionality), these techniques will often have a poor starting point. As a result, these techniques will generally accumulate more errors during respective iterations.
In view of the above, semi-supervised learning methods construct classifiers using both labeled and unlabeled training data samples. While unlabeled data samples can help to improve the accuracy of trained models to certain extent, existing methods still face difficulties when labeled data is not sufficient and biased against the underlying data distribution.