Field
The described embodiments relate to techniques for selecting subsets of comments for annotation. More specifically, described embodiments relate to techniques for selecting subsets of comments for annotation based on an annotation probability distribution that specifies annotation bias of annotations provided by reviewers.
Related Art
Online crowdsourcing platforms are increasingly popular ways to leverage Internet users across the world to provide a scalable technique for annotating datasets for various machine learning tasks. Although these crowdsourcing platforms are less expensive than employing and training expert annotators, crowdsourcing can still be expensive because building a high-performance classifiers often requires large sets of annotated data with multiple annotations for each data item.
One approach for addressing this problem is active learning, in which a particular unlabeled data instance is selected for labeling in an attempt to improve the classifier performance. However, traditional active-learning techniques often assume reliable annotators. This assumption is usually not valid with crowdsourcing. In addition to the annotation bias for each individual annotator, there can be interference between data items simultaneously presented for annotation through crowdsourcing. For example, there are often situations in which batches of multiple data items are judged by crowds at the same time. In particular, when evaluating results of a search engine given a certain query, the retrieved web pages are usually judged by crowds (either by explicit labeling or implicit click through rate) in batches. Other examples include object recognition and clustering. In general, batch active learning may be particularly vulnerable as multiple data items are submitted simultaneously for annotation, both to reduce annotation costs and to minimize classifier retraining cycles. The resulting annotation bias can degrade the quality of services based on the annotated data, which can be frustrating to users of these services.
Note that like reference numerals refer to corresponding parts throughout the drawings. Moreover, multiple instances of the same part are designated by a common prefix separated from an instance number by a dash.