Conventional vision detection approaches have shown that with a large amount of annotated data, convolutional neural networks (CNNs) can be trained to achieve a super-human performance for various visual recognition tasks. However, these conventional vision detection methods have failed to investigate into effective approaches for data annotation, since data annotation is essential and expensive. For example, data annotation is especially expensive for object detection tasks. Compared to annotating image classes, which can be done via a multiple-choice question, annotating object location requires a human annotator to specify a bounding box for an object. Simply dragging a tight bounding box to enclose an object can cost 10-times more time than answering a multiple-choice question. Consequently, a higher pay rate has to be paid to a human labeler for annotating images for an object detection task. In addition to the cost, which is more difficult to monitor and control is the annotation quality.
Accordingly, there is need to achieve better performance with less annotation processes and, hence, less annotation budgets, among other things.