Cameras are ubiquitous, with millions of them deployed by government and private entities at traffic intersections, enterprise offices, and retail stores. Video from at least some of these cameras are continuously recorded. One of the main purposes for recording the videos is answering “after-the-fact” queries. An after-the-fact query can include identifying video frames with objects of certain classes (e.g., cars or bags) over many days of recorded video. As results from these queries are used by analysts and investigators, achieving low query latencies, while maintaining query accuracy, can be advantageous.
Advances in convolutional neural networks (CNNs), backed by copious training data and hardware accelerators (e.g., GPUs), have led to high accuracy in the computer vision tasks like object detection and object classification. For example, the ResNet152 object classifier CNN won the ImageNet challenge that evaluates classification accuracy on 1,000 classes using a public image dataset with labeled ground truths. For each image, these classifiers return a ranked list of 1,000 classes in decreasing order of confidence.
Despite the accuracy of conventional image classifier CNNs (like ResNet152), using them for video analytics queries is both expensive and slow. Using the ResNet152 classifier at query-time to identify video frames with cars on a month-long traffic video includes 280 GPU hours and cost a significant amount of money to use the corresponding computing cloud. The latency for running queries is also high. To achieve a query latency of one minute on 280 GPU hours of work would involve tens of thousands of GPUs classifying the frames of the video in parallel, which is many orders of magnitude more than what is typically provided (few tens or hundreds) by traffic jurisdictions or retail stores.