Systems and methods herein generally relate to labeling data instances from data streams, and more particularly to supervised learning that uses trained machine classifiers.
Modern computerized systems automatically classify extremely large volumes of data quickly and efficiently using classification rules contained within items that are sometimes referred to as “models.” Such models need to be trained to ensure that they are properly classifying the incoming data streams. Such training often involves selecting instances from the data stream and presenting such instances to a human operator for annotation or classification. This process is sometimes referred to as supervised learning.
It is more helpful to have the human operator annotate selected data instances for which there is a low classification confidence in supervised learning. This is because data instances that the current model finds difficult to classify (i.e., those having a low classification confidence) are the most useful data instances to obtain human input on, because they can provide the greatest incremental increase in classification accuracy. In view of this, the elements that select data instances to be annotated by humans generally select data instances that have a classification confidence that is below a classification confidence threshold, to allow the human annotations to provide the greatest incremental increase in classification accuracy for the classification model.
Supervised learning trains machine classifiers on hundreds or thousands of labeled instances. For example, supervised learning can be used for sentiment analysis in Twitter® products (Twitter, Inc., 1355 Market Street, Suite 900, San Francisco, Calif. 94103 USA) streaming data, where the Tweets® (data instances within the data stream) to be classified or labeled are selected by human annotators using keywords or data ranges. Because such labeling of data instances within the data streams is often performed by human experts, the labeled instances are difficult to obtain, time-consuming and/or expensive in many cases. The idea of active learning for the model is to achieve high accuracy with as few manually labeled instances as possible, thereby reducing the labeling cost. In general, active learning involves actively selecting instances for labeling from the available unlabeled data based on a well-defined strategy, as opposed to randomly selected instances.
There are a number of different strategies used for active learning, and most can be categorized into two approaches: 1) pool-based methods that select instances from an available pool of unlabeled instances and 2) stream-based methods that select samples from an incoming stream of unlabeled instances.