The advent of inexpensive cameras and inexpensive storage has made it practical to store images and video in very large databases. For example, it is estimated that one popular social media provider stores about 80 billion images, and processes 600,000 images per second.
The commercial viability of such databases depends in large part on the availability of search and retrieval application. Thus, a great effort has been devoted to search and retrieval mechanisms for images and images. In general, such mechanisms rely on identifying points of interest in an image, often referred to as keypoints, and then extracting features from these keypoints that remain accurate when subject to variations in translation, rotation, scaling and illumination.
Examples of such features include scale-invariant feature transform (SIFT), speeded-up robust features (SURF), binary robust invariant scalable keypoints (BRISK), fast retina keypoint (FREAK), histogram of oriented gradients (HoG), circular Fourier-HOG (CHOG), and others.
To reduce the bandwidth and complexity of such applications, while preserving matching accuracy and speed, the features are often aggregated and summarized to more compact descriptors. Approaches for compacting the feature spaces include principal component analysis (PCA), linear discriminant analysis (LDA), boosting, spectral hashing, and the Bag-of-Features (BoF) approach. The BoF is used to convert features to compact descriptors (codewords) using cluster centers produced by means clustering.
The compact descriptors extracted from a query image or video can be compared to descriptors extracted from images in the database to determine similar images. The International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) has published a standard Compact Descriptors for Visual Search (CDVS) that is designed to address the challenges for searching image datasets. MPEG CDVS provides a standardized way on descriptors for efficient still image search applications. The major steps in the pipeline include:
1. Take an image as input;
2. Detect keypoints in the image;
3. Extract and aggregate local descriptors for the keypoints;
4. Generate a global descriptor, scalable compressed fisher vector (SCFV);
5. Compress the global descriptor;
6. Code the coordinates of the selected keypoints;
7. Compress the selected local descriptors; and
8. Output the compacted descriptor bitstream.
To extend conventional image descriptors to derive video descriptors is not straightforward. One naive method extracts image descriptors from each image in the video sequence, treating each image separately. That method fails to exploit the fact that features extracted from successive video images tend to be very similar, and describe similar keypoints, resulting in a very redundant representation. Furthermore, that method does not remove features that are not persistent from image to image, and probably does not describe the video sequence very well. Thus, simply collecting individual image descriptors is bandwidth-inefficient and significantly increase matching complexity.
Another approach is to compress the descriptors derived from each video image, exploiting interframe predictions in the video sequence. Those methods exploit an affinement transformation, to predict feature descriptors and keypoint locations. However, those methods are limited in several aspects:
1) They do not address the problem of discovering a small set of descriptors that can represent a visually salient object.
2) Such approaches generally do not provide keypoint trajectories on the video signal.
3) Affine transformation used to code the keypoint locations in the subsequent pictures involves high complexity for extraction and is likely to suffer when accurate motion details are of the concern. And it is very difficult to use a single particular motion model to cover all types of motion in a practical application.
Yet another approach utilizes low-rank non-negative matrix factorization that can exploit the near stationarity of salient object descriptors in the scene. Hence, that method can determine low dimensional clusters from the visual descriptors extracted from videos. Unfortunately, that approach quickly becomes unstable as the number of clusters increases. In addition, that approach does not provide a full representation for keypoint trajectories.
FIG. 1 shows the work scope of MPEG-7 prior art. Features are extracted 110 from a video 110 and MPEG-7 descriptors is then generated 130. The descriptors can be provided to a search engine 140 to determine content in a database 150 with similar descriptors. MPEG-7 is a standard for describing features of multimedia content. It does not provide a way to extract multimedia descriptors and features, or a way to measure similarity between contents.
More specifically, MPEG-7 standardized motion descriptors can be classified into two categories, as shown in FIG. 2. One includes camera motion 230, motion activity 240, warping parameters 250 for video segments 210, and the other includes motion trajectory 260, parametric motion 270 for moving regions 220 in the video.
The motion trajectory describes the movement of one representative point of a specific region, composing of a set of positions and a set of interpolation functions describing the path. Because the motion trajectory is for a moving region, motion blobs are typically detected and each motion blob corresponds to a trajectory descriptor. In a typical application for a sport video, each player has a single representation point and one trajectory maintained.
The motion descriptors in MPEG-7 prior art work provide high-level semantic information for video analysis tasks, and such motions are limited for rigid objects, which is inefficient for some nowadays applications, e.g., human action recognition. The motion descriptors defined by MPEG-7 for a rigid object are at object levels, and it has not been an issue to represent the object-level trajectories in an efficient way at that time.