Modern computer vision systems can be applied to digital content for purposes of performing visual recognition tasks including, for example, face recognition, medical imaging, scene understanding for self-driving, and the like. The major bottleneck in applying such systems is the need for large-scale annotated datasets. Such systems must be trained with millions of annotated examples in order to function properly for a given task. Training and deploying computer vision into products today requires a significant amount of effort into annotating datasets (e.g., both by human and machine alike), thereby reducing the speed in which such systems are trained and ready for implementation, and drastically delaying the time-to-market.
In today's world, video understanding is one of the most important areas of research and development across the media industry. Unfortunately, compared to image datasets, video datasets are notoriously difficult to annotate due to the sheer number of frames that need to be inspected by labelers. For example, the task of drawing bounding boxes around certain objects for every single frame of a video requires significant amounts of time and effort, high utilization of computer and network resources, and is not necessarily always accurate.
This, therefore, has motivated several prior works to use a workaround solution where a visual recognizer is trained on an image dataset and then applied to the video domain. However, this does not perform well in practice since video frames do not manifest the same visual characteristics as images. This is because of the way they are captured and encoded. Videos capture dynamic moving objects while images capture static objects, and the location of the objects within/across video frames can change while the location within an image remains the same. Additionally, most video codecs apply compression algorithms to make the file smaller, which can result in blurry frames, and image files are typically not subject to compression. Therefore, in order to apply such conventional techniques, visual recognizers trained on image datasets must be readjusted for each video frame, which further limits the speed and accuracy upon which the systems can be built and applied.