Detection and understanding of objects, actions, events, and segments in digital videos are significant problems in, and desired functionalities for, computer vision. For example, video understanding may be needed in order to detect certain objects, actions, events, and segments in television broadcasts, motion pictures, videos shared via social media, and surveillance videos. Such detection may even improve an X-ray or a magnetic resonance imaging (MRI) scan feature, or perhaps to provide new computer vision services for a cloud services provider. Additionally, some tasks such as object detection and action recognition may potentially be run on resource-constrained devices such as smartphones, tablets, or even Internet of Things (IoT) devices such as security cameras.
As video files tend to be sizeable, the use of heavily compressed files is in favor. However, most modern vision algorithms require decompressing the video and processing the uncompressed video frame-by-frame. This is expensive in terms of computation, memory and storage. By way of example, Faster R-CNN (Regional Convolutional Neural Network), an object detection algorithm well known in the art, runs at 5 FPS (frames per second) on 720p videos in one case (but will vary as a function of processor speed), so vision processing, via that algorithm, for one hour of a video requires approximately five hours of computing time. This makes it impractical, or at least computationally expensive, to run such high-accuracy algorithms on devices such as smartphones and small cameras (e.g., security cameras), or to do real-time vision tasks on such storage and computationally-constrained devices. Video frames being redundant in large part, other approaches attempt to sample the frames every so often and skip others, but such approaches remain relatively slow and can introduce errors in the form of missed objects.