The exemplary embodiment relates to active learning. It finds particular application in connection with a system and method for generating a visual representation of uncertainty associated with the labeling of elements for assisting a human annotator in selecting the next element to label.
The goal of active learning is to identify patterns based on a limited amount of data. The technique is currently used in machine learning tasks, such as classification, when the work of automatically labeling data is too costly. In the active learning stage, a human annotator chooses an appropriate discrete class from a set of classes available for labeling an element and labels the element accordingly. Based on the labeling of a group of such elements, a model is progressively learned which allows new elements to be labeled automatically by the model, based on their extracted features. Active learning finds application in a variety of fields. One example is the classification of documents according to content-based classes (such as “sports,” “politics,” “business,” “science,” etc., in the case of new articles). Here, the elements to be labeled are the documents themselves and the features used by the model may be words, phrases, or the like which occur within the document. Another application is the labeling of parts of a document, such as labeling the title, author, etc. Here the features may be related to document structure, font size, position on the page, etc. Yet another application is in the labeling of images according to visual classes based on the visual content, where the elements to be labeled are the images and the features may be extracted from patches of the image using image processing techniques.
Several approaches have been developed to make the active learning process more efficient. In some approaches, a user annotates an element that is proposed by an algorithm to improve the model quality. Since the manual annotation is often costly, a goal of these active learning approaches is to reduce the number of elements to annotate by an iterative process in which the model is updated as new elements are labeled. At each iteration, an algorithm aims to propose that the annotator labels the element which has the maximum benefit for the classifier. These approaches can allow a significant reduction in the training set required to build a relevant model which is then able to label the remaining unannotated dataset. There are, however, several drawbacks to this framework. First, the user has no relevant information about the quality of the current model. Although some metric based on uncertainty of unannotated element prediction may give a general idea of model quality, this is generally insufficient to provide the annotator with enough information to be able to make meaningful decisions for when to allow automatic labeling. In particular, the annotator has no knowledge of where the uncertainty in the model lies and how many and what kind of elements remain to be annotated.
Second, traditional active learning methods do not permit the annotator to select elements for labeling. Rather, the next element to label is chosen by the machine and the user's only responsibility is to associate a class with the proposed element. Where the classification is a two-class problem, a machine may be programmed to identify suitable elements for labeling which will improve the model. For multi-class problems, however, the complexity of identifying elements for labeling rapidly increases with the number of classes. In practice, no active learning algorithm is optimal for all datasets.
In one approach, referred to as the Uncertainty Based Sampling method, the aim is to label, at each iteration, the least certain element according to the current classifier. (See, LEWIS, D., AND GALE, W. A sequential algorithm for training text classifiers. In Proc. Int'l ACM-SIGIR Conf. on Research and Development in Information Retrieval (1994)). Another approach, Query by Committee, chooses the element which maximizes disagreement between several classifiers (See, SEUNG, H. S., OPPER, M., AND SOMPOLINSKY, H. Query by committee. In Proc. 5th Annual ACM Workshop on Computational Learning Theory (1992), pp. 287-284). The Error rate reducing method tries to select an element that, once added to the training set, minimizes the error of generalization (See, ROY, N., AND MCCALLUM, A. Toward optimal active learning through sampling estimation of error reduction. In Proc. 18th Int'l Conf. on Machine Learning (ICML) (2001), pp. 441-48). Other approaches combine several active learning algorithms. (See, for example, OSUGI, T., KUN, D., AND SCOTT, S. Balancing exploration and exploitation: A new algorithm for active machine learning. In Proc. 5th Int'l Conf on Data Mining (ICDM) (2005), pp. 330-337).
When the dataset is large, or when the classifier used is computationally complex, learning or inference may be computationally expensive. In such cases, the user annotates several elements at each iteration. However, in such a process, there is a tendency for the system to propose similar elements for labeling. Pre-Clustering methods aim to reduce this (See NGUYEN, H., AND SMEULDERS, A. Active learning using pre-clustering. In Proc. 21st Int'l Conf. on Machine Learning (ICML) (2004), pp. 79-86). However, this approach complicates the active learning process and ensuring a good clusterization is difficult.
Sometimes, the cost associated with the annotation is different according to the element which is to be labeled. Automatic learning does not take into account the annotation cost.
In traditional approaches, the annotator continues the active learning stage until the model is believed to be relevant to be applied automatically to non-annotated data. However, he may not have a reliable understanding of model quality to make such a decision. There are some active learning systems that involve the user in the decision process. In one approach, a user can switch between two modes: either to annotate the least confident unlabeled data (as in Uncertainty Based Sampling) or to annotate the most confident set. (See, CHIDLOVSKII, B., FUSELIER, J., AND LECERF, L. Aldai: active learning documents annotation interface. In ACM Symp. on Document Engineering (2006), pp. 184-185). A plot showing the evolution of model confidence helps him to make a good tradeoff between the two annotation modes. Such approaches, however, give no relevant information about the model uncertainty that would permit the annotator to choose the next element to annotate.
Another kind of interactive approach is semi-supervised visual clustering (See, CHUNG, K. F.-L., WANG, S., DENG, Z., SHU, C, AND Hu, D. Clustering analysis of gene expression data based on semi-supervised visual clustering algorithm. SoftComput. 10, 11 (2006), pp. 981-993; and CHIDLOVSKII, B., AND LECERF, L. Semi-supervised visual clustering for spherical coordinates systems. In 23rd Annual ACM Symp. on Applied Computing (2008)). Here, the user annotates unlabeled data, helped by an interactive visual clustering system. The visualization is useful to understand structure of data but does not permit the annotator to visualize the model quality on the data for the annotation task. The aim is to clusterize data with help of labeled elements but not to find the minimal training set required to build a relevant model.
The exemplary embodiment provides a system and method for providing an annotator with information on the current model uncertainty so that he will be able to make an intelligent choice regarding the next element to label.