The advent of inexpensive cameras and inexpensive storage has made it practical to collect images and video for storage in very large databases. For example, it is estimated that one popular social media provider stores about 80 billion images, and processes 600,000 images per second.
The commercial viability of such databases depends in large part on the availability of search and retrieval application. Thus, a great effort has been devoted to search and retrieval mechanisms for images. In general, such mechanisms rely on identifying points of interest in an image, often referred to as keypoints, and then extracting features from these points that remain accurate when subject to variations in translation, rotation, scaling and illumination.
Examples of such features include scale-invariant feature transform (SIFT), speeded-up robust features (SURF), binary robust invariant scalable keypoints (BRISK), fast retina keypoint (FREAK), histogram of oriented gradients (HoG), circular Fourier-HOG (CHOG), others.
To reduce the bandwidth and complexity of such applications, while preserving matching accuracy and speed, the features are often aggregated and summarized to more compact descriptors. Approaches for compacting the feature spaces include principal component analysis (PCA), linear discriminant analysis (LDA), boosting, spectral hashing, and the popular Bag-of-Features approach. The latter converts features to compact descriptors codewords) using cluster centers produced by means clustering.
The compact descriptors extracted from a query image or video can be compared to descriptors extracted from images in the database to determine similar images. There has, however, been much less work in developing efficient feature matching mechanisms for video queries.
To extend conventional image descriptors to derive video descriptors is not straightforward. One naïve method extracts image descriptors from each image in the video sequence, treating each image separately. That method fails to exploit the fact that features extracted from successive video images tend to be very similar, and describe similar keypoints, resulting in a very redundant representation. Furthermore, that method does not remove features that are not persistent from image to image, and probably does not describe the video sequence very well. Thus, simply collecting individual image descriptors is bandwidth-inefficient and significantly increase matching complexity.
A more efficient approach is to compress the descriptors derived from each video image, exploiting the motion of those descriptors through the video sequence. Those methods exploit powerful paradigms from video compression, such as Motion compensated prediction and rate-distortion optimization, to reduce the bit-rate of the transmitted descriptors. However, those methods do not address the problem of discovering a small set of descriptors that can represent a visually salient object.