Machine learning algorithms have been shown to be practical methods for real-world recognition problems. They also have proven to be efficient in domains that are highly dynamic with respect to many values and conditions. Some machine learning algorithms are suitable for classification (or predictive modeling), while others have been developed for clustering (or descriptive modeling) purposes. Clustering is used to generate an overview of the relationship of the data records. The output of such algorithms may be several clusters, where each cluster contains a set of homogeneous records. As applied to analytical customer relationship management (CRM), for example, clusters may comprise groups of customer records with similar characteristics. For clustering, no labeled data is needed. In classification, on the other hand, a set of known, fixed categories and a pool of labeled records (known as training data) are needed to build a classification model. Classification models can be widely used in analytical CRM systems to categorize customer records into predefined classes.
One of the obstacles to classification is the lack of available labeled data. A problem that arises in various application domains is the availability of large amounts of unlabeled data compared to relatively scarce labeled data. Recently, semi-supervised learning has been proposed with the promise of overcoming this issue and boosting the capability of learning algorithms. Semi-supervised learning uses both labeled and unlabeled data and can be applied to improve classification and clustering algorithm performance.
Unlabeled data can be collected by automated means from various databases, while labeled data may require input from human experts or other limited or expensive categorization resources. The fact that unlabeled data is readily available, or inexpensive to collect, can be appealing and one may want to use them. However, despite the natural appeal of using unlabeled data, it is not obvious how records without labels can help to develop a system for the purpose of predicting the labels.