1. Technical Field
The present disclosure relates to data labeling and more specifically to crowd-sourced data labeling.
2. Introduction
Labeled data is vital for training statistical models. For instance, labeled data is used to train automatic speech recognition engines, text-to-speech engines, machine translation systems, internet search engines, video analysis algorithms, and so forth. In all these applications, increasing the amount of labeled data generally yields better performance. Thus, gathering large amounts of labeled data is extremely important to advancing performance in a wide range of technologies.
Traditional approaches to labeling data rely on hiring and training experts. Here, each data instance is examined and labeled by an expert. Sometimes, each data instance is also checked by another expert. Disadvantageously, the traditional process of labeling data with experts is expensive and slow: hiring and training experts can be very costly, and experts require many hours of work to label even a comparatively small number of instances. This approach is also impractical and inefficient. For example, it is impractical to swiftly add and discharge experts, and difficult to label a burst of data rapidly. Moreover, it is often hard to find enough experts for large labeling projects, particularly when the volume of work fluctuates.
Recently, crowd-sourcing has emerged as a faster and cheaper approach to labeling data, enabled by platforms such as Amazon's Mechanical Turk. In crowd-sourcing, a large task is divided into smaller tasks. The smaller tasks are then distributed to a large pool of crowd workers, typically through a website. The crowd workers complete the smaller tasks for very small payments, resulting in substantially lower overall costs. Further, the crowd workers work concurrently, greatly speeding up the completion of the original large task.
Despite the speed improvements and lower costs, crowd-sourcing is limited in several ways. For example, individual crowd workers are often inaccurate and generally produce lower quality labels. Requesting a greater, fixed number of labels can improve overall accuracy, but in practice, many of these are not needed, resulting in wasted expense. Automatic labelers are sometimes combined with crowd-sourcing to increase accuracy. However, current implementations are open to cheating by crowd workers, as the output from the automatic labelers is given to the crowd workers as a suggested label, and the workers have an obvious incentive to make as few edits as possible, as they are paid by the task. These and other challenges remain as significant obstacles to improving a wide range of technologies that rely on labeled data.