(1) Field of Invention
The present invention relates to a system for content recognition, search, and retrieval in visual data and, more particularly, to a system for content recognition, search, and retrieval in visual data which combines multi-level descriptor generation, hierarchical indexing, and active learning-based query refinement.
(2) Description of Related Art
Content-based video retrieval is the application of computer vision to the video image retrieval problem (i.e., the problem of searching for digital images in large databases). The term “content-based” refers to the fact that the search will analyze the actual contents of the images in the video, such as colors, shapes, textures, activities, events, or any other type of information that can be retrieved from the video images.
Currently, the ability to efficiently perform a video search based on contents in terms of activities or events is lacking. The need for rapid and accurate video search and monitoring cannot be met with the current labor-intensive methods. State-of-the-art techniques for video object and activity classification require extensive training on large datasets with manually annotated ground truth and are often brittle to slight changes in illumination, view angle, movement, and the environment. Existing content-based video search capabilities rely on meta data that are provided by human annotation, which is labor intensive and non-practical for large video databases.
Although content-based image retrieval has been a focus of research for many years, most of the approaches are focused on using statistical information content in the images and are not directly applicable to video retrieval, since it is non-trivial to accurately model and recognize the dynamically changing video content and context. Although visual vocabulary approaches for image retrieval showed great potential for handling images, such approaches have not been extended to videos. Similarly, spatio-temporal descriptors have been developed for action recognition and classification, but these descriptors have been used only to model and classify activities and not for efficient video search and retrieval. Current systems, as will be described in further detail below, make the problem of activity recognition and search in videos complex and unwieldy. Furthermore, the current search and retrieval methods cannot scale to efficiently index and retrieve video of interest from large video repositories in a few seconds.
For example, Schetman and Irani describe an approach for matching activity descriptions in video using local self-similarity descriptors in “Matching Local Self-Similarities Across Images and Videos” in Institute of Electronics and Electrical Engineers (IEEE) Conference on Computer Vision and Pattern Recognition, 2007. The matching algorithm described is based on optimization of a distance function and cannot be scaled to large video archives.
In “Video Retrieval Using Spatio-Temporal Descriptors” in Association for Computing Machinery, pp. 508-517, 2003, DeMenthon and Doermann present a method using a binary tree approach by clustering image region features and region velocity vectors. However, the region descriptor described by DeMenthon and Doermann is more global in nature and region-based, requiring segmentation. The approach described by the authors does not address search scalability.
Furthermore, related art is described by Sivic and Zissermen in “Video Google: A Text Retrieval Approach to Object Making in Videos” in Proceedings of the IEEE International Conference on Computer Vision, 2003 and by Nister and Stewenius in “Scalable Recognition with a Vocabulary Tree” in IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2161-2168, 2006. These two references present visual vocabulary and vocabulary tree approaches, respectively, for searching images in image archives using two-dimensional image descriptors. However, the references do not disclose the use of spatio-temporal descriptors, nor do they propose searching videos based on activity contents.
Finally, additional related art is presented by Scovanner et al. in “A 3-Dimensional SIFT Descriptor and its Application to Action Recognition” in Proceedings of Multimedia, pp. 357-360, 2007, Niebles et al. in “Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words” in British Machine Vision Conference, 2006, and Dollar et al. in “Behavior Recognition Via Sparse Spatio-Temporal Features” in IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65-72, 2005. All of these references describe approaches using spatio-temporal descriptors for human action recognition. Each of the approaches presented in the references require labeled data for training classifiers. These approaches also do not address how the spatio-temporal descriptors can be used for video search. Each of the references referred to above and below is hereby incorporated by reference as though fully set forth herein.
Thus, a continuing need exists for a system which allows rapid and efficient content recognition, search, and retrieval in visual data for content which is based on activities or events using only unlabeled data and unsupervised training.