There are many occasions in data processing in which decisions need to be made: whether an image contains a face or not, whether an email is spam or not spam, whether a target is enemy aircraft or not an enemy aircraft, etc. Many conventional techniques for classifying data, such as images, have been developed over the years. All such techniques seek to provide what is known as a classifier. A classifier is a decision rule or function that renders decisions given training input examples. As used herein, a decision rule is a function that maps input data or examples (e.g., an example of an image) to a label (e.g., a face).
One of the most heavily employed conventional method for learning classification rules is known as supervised learning. In supervised learning, a classification rules is learned from labeled input examples (say, an image with a face or a not-face label): a classifier is trained on the labeled examples to output a decision rule. Once trained, the classifier/decision rule may be employed to render decisions on new input examples.
Formally, in classical supervised learning, given training examples:L={(xi,y1), . . . , (xl,yl)},xεX,yε{−1,1}drawn from a fixed but unknown probability distribution P(x, y), i.e., the goal is to find in the given set of functions ƒ(x, α), αεΛ the function ƒ(x, α0) that minimizes the expectation of error as:ƒ(x,α0)=argminαεΛR(α)=∫(y−ƒ(x,α))2dP(x,y).
In this basic setting of supervised learning, the labels y's are known for all the training examples x's and there is no cost associated with obtaining the labels
The active learning paradigm is a modification of supervised learning. In active learning, there is a cost associated with labels of the training examples, and the goal is to learn a decision rule that is as accurate as possible while minimizing the cost of labels. In other words, during training, the classifier should use as few labels as possible. This setting is useful when unlabeled examples are available relatively easily but labeling the examples is expensive or requires manual effort. For instance, many unlabeled emails might be available, but a user would like to label only a limited number of emails as spam or not spam. The classifier, during training, should present as few emails as possible to obtain spam or not spam labels from a user.
Formally, in active learning, given training data as a set of labeled examples:L={(xi,y1), . . . , (xl,yl)},xεX,yε{−1,1}and another set of unlabeled examples:U={(xl+1), . . . , (xl+T)},xεX, and access to an oracle which can provide labels for examples in the set U, find a decision rule (that provides better results than the one obtained by using supervised learning on the set L) by making as few queries to the oracle as possible. More particularly, an active learning classification method learns a decision rule from labeled examples and a subset of unlabeled example, for which the algorithm queries the labels from an oracle. A goal of active learning is to efficiently query labels for examples in the set of unlabeled examples. Efficiency may be measured as the number of queries made by the learning algorithm to achieve a particular performance: the number of queries to the oracle needs to be minimized. Further, efficiency may also be measured as the amount of computation required to select examples from the set of unlabeled examples.
In the active supervised learning paradigm, the method selects examples from a set of unlabeled examples and queries labels for the selected examples. More particularly, in active supervised learning, given existing knowledge (a collection of labeled data points or examples), additional unlabeled (i.e., “inexpensive”) data points or examples may be employed to further train the decision rule. Queries made to the oracle to obtain a label should minimized in terms of an associated cost (e.g., labels that need to be assigned by a human expert are considered more costly than those that may be assigned by execution of a simple computer algorithm). The labels for samples from the set U should be obtained only for those examples, for which the active supervised learning method is most uncertain.
Measures of uncertainty may include entropy, distance to the decision rule, and uncertainty sampling. In uncertainty sampling, a label is obtained for a sample from the set U for which a classifier is least confident in its prediction. When a classifier renders a prediction, the classifier may also provide a value of the confidence it has in making this prediction, where the confidence is expressed as a probability between 1 and −1. For example, if the confidence value is 1, then an image is confidently classified as a face. If it is −1 then the image is confidently classified as not a face. If the probability is near 0, then the classifier is not confident in its prediction. Labels are thus obtained for those samples in the set U for which the confidence values are near 0.
Another criteria for selecting samples from the set U is to use information based sampling. It is desirable to select labels that are “informative”, i.e., labels that optimize expected gain in information about a data point while simultaneously minimizing the probability of error in labeling. For example, suppose the training data for a classifier comprises 100 images of a face and all of them were alike, and suppose the 101st image is very similar to first 100 images. The classifier would assign a high level of confidence that the image is a face to the 101st image, so little additional information would be provided by obtaining a label for this 101st image. If the 101st image is, for example, is a side view and it is very difficult to classify whether it is an image of a face or not a face, then obtaining a label for this image would be much more relevant and informative to the classifier for making a correction to its prediction function. Thus, by meeting the goals of minimizing queries to the oracle for informative labels, a decision rule is obtained by training or learning from fewer labeled data points in less time than supervised learning.
The active supervised learning paradigm described above applies to the classical supervised learning paradigm where one can query only the labels during the training stage. In reality, however, for many problems, much more information in addition to labels is available during the training stage. For instance, while labeling an email as spam or not-spam, a user can mark/highlight parts of the email that can be attributed to spam; this situation is very common. In this setting, the algorithm can query either the label, the additional information, or both during the training stage.