Machine learning-based classifiers have a variety of important uses, including, for example, classification of Web documents. Traditionally, binary classifiers have been built using manually collected sets of positive and negative examples. Collecting the labeled data is highly time consuming and requires a large amount of human effort. For example, this is notably true in classification problems, including, for example, web page classification, where the negative class is defined as the universe excluding the positive examples. For the negative class, it is difficult and time-consuming to create a set of examples which would represent the real world distribution. For example, to build a classifier that identifies pages containing reviews of entities such as businesses and products, the negative class will include all web pages excluding review pages. There is no clear and efficient way to describe and sample documents from this class. Generally, the goal is to identify the positive set of examples from the larger universal set.
Some approaches address problems that may be associated with labeling of negative class examples by building classifiers using positive and unlabeled examples. In such approaches, essentially, the unlabeled examples are labeled as negative examples. Conventional binary classifiers are then built. Such approaches avoid the effort and difficulty required to label the negative set of examples. However, issues such as the presence of positive examples in the unlabeled set and a high imbalance in the ratio of positive to negative examples bring in challenges in building these classifiers, and existing approaches are generally computationally expensive.
There is a need for efficient machine-learning-based classifiers for binary classification of items, using positive and unlabelled examples.