1. Technical Field
The disclosed embodiments relate to a system and methods for active learning to train classifiers in multiple categories, and more particularly, to efficiently train classifiers in multiple categories by requiring far fewer editorially-labeled examples from large datasets, and to test the trained classifiers on unlabeled data sets with the same methods.
2. Related Art
The rapid growth and ever-changing nature of web content demands automated methods of managing it. One such methodology is categorization in which document (and other types of) content is automatically placed into nodes of a human-induced taxonomy. Taxonomy is a hierarchy of categories; taxonomies defined for the web are typically large, often involving thousands of categories. Maintaining the relevance of the classifiers trained on such taxonomies over time, and the placement of new types of content such as ads, videos, forum-posts, products, feeds and the other data “examples” into a pre-defined taxonomy require the availability of a large amount of labeled data. The content of the web is ever-growing, so classifiers must be continually updated with newly-labeled examples.
Labeling data is an expensive task, especially when the categorization problem is multiclass in nature and the available editorial resources have to be used efficiently. Editorial resources refer to human editors who manually review an example to label it. Active learning is a well-studied methodology that attempts to maximize the efficiency of the labeling process in such scenarios. Active learning typically proceeds by first training an initial model on a small labeled dataset. Provided that there are a large number of unlabeled examples, it then selects an unlabeled example that it believes is “informative” and will improve the classification performance the most if its label is revealed. The example is then labeled by human editors and added to the initial training set. This procedure is repeated iteratively until convergence of the performance, or in a more realistic restriction, while labeling resources are available. In a more realistic setting, to limit the turnaround cycle, active learning selects not just one but a batch of informative examples to be labeled during each active learning iteration.
Existing active learning approaches differ in the technique used to define the informativeness of a data point or example. While some solutions focus exclusively on binary categorization problems, some are restricted to specific types of classifiers, and some others require a number of extra classifiers to be trained for each classification task. These approaches become infeasible, however, when dealing with a large, multiclass categorization problem such as the ones that abound in real-world web content.
One straight-forward multiclass, active-learning strategy is to apply binary active learning techniques on decomposed, binary subproblems and select the topmost-informative examples independently for each binary classifier. These are examples of local active learning methods, which have been criticized for their lack of ability to scale to real-world problems. For instance, if a taxonomy contains a thousand nodes, choosing a single most-informative example per binary subproblem would account for a thousand examples to be labeled during each iteration of the multiclass problem. Performing just twenty iterations—as performed in the experiments disclosed herein—would require editorial resources for labeling twenty thousand examples, which would be infeasible in most real-world applications.