Video surveillance and video in general is becoming more and more prominent in private as well as public spaces, as well as on the Internet and on other remotely-accessible media. As the amount of video stored on various computer systems increases, it becomes more difficult to search for desirable videos. In some instances, a video search may be carried out by selecting a video clip, and then having a computer system automatically retrieve similar videos. Different types of similarities may be compared in order to retrieve relevant videos.
For a conventional video retrieval system, color (histogram or correlogram) and visual features (e.g. HOG, SIFT) are commonly used to find similar scenes, rather than finding similar activities. See, e.g., C. F. Chang, W. Chen, H. J. Meng, H. Sundaram, D. Zhong, “A Fully Automated Content Based Video Search Engine Supporting Spatio-Temporal Queries,” PAMI, 1998 (referred to herein as “Chang”); J. C. Niebles, H. Wang, L. Fei-Fei, “Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words,” IJCV 2008 (referred to herein as “Niebles”); and Y. Wang, P. Sabzmeydani, G. Mori, “Semi-latent Dirichlet allocation: A hierarchical model for human action recognition”, Workshop on Human Motion Understanding, Modeling, Capture and Animation, 2007 (referred to herein as “Wang”), each of which is incorporated by reference herein in its entirety. Especially in surveillance videos, since the activities are often taken at the same sites, conventional retrieval methods cannot typically detect activities of interest. Certain video search schemes are able to retrieve video events using time intervals, and may also include video retrieval concept detectors, which handle multi-modal queries and fuse them to find the best matching videos. See, e.g., C. G. M. Snoek, M. Worring, “Multimedia Event-Based Video Indexing Using Time Intervals,” IEEE Trans. on Multimedia, Vol. 7, NO. 4, AUGUST 2005 (hereinafter referred to as “Snoek1”); and C. G. M. Snoek, B Huurnink, L Hollink, M. D. Rijke, G. Schreiber, M. Worring, “Adding semantics to detectors for video retrieval,” IEEE Trans. on Multimedia, 2007 (referred to herein as “Snoek2”), each of which is incorporated by reference herein in its entirety. However, these systems may fail to detect semantic events from the videos due to detection error or noise in a video, and those videos will thus not be considered as a search result candidate.
In recent papers, Markov Logic Networks (MLN) and Stochastic Context Sensitive Grammar (SCSG) are described for use with video data representation. SCSGs construct a scene parse graph by parsing stochastic attribute grammars. See, e.g., M. Richardson, P. Domingos “Markov logic networks.” Mach. Learn., 62:107-136, 2006 (referred to herein as “Richardson”); and S. C. Zhu, D. Mumford, “Quest for a stochastic grammar of images”, Foundations and Trends of Computer Graphics and Vision, vol. 2, no. 4, pp 259-362, 2006 (referred to herein as “Zhu”), each of which is incorporated by reference herein in its entirety. Embodying SCSG, the And-Or graph (AOG) is introduced for scene understanding and can flexibly express more complex and topological structures of the scene, objects, and activities. See, e.g., T. Wu, S. Zhu, “A Numeric Study of the Bottom-up Top-down Inference Processes in And-Or Graphs,” ICCV, 2009 (referred to herein as “Wu”), which is incorporated by reference herein in its entirety. In some examples, objects and activities, and their spatial, temporal, and ontological relationships in a scene, are modeled and represented with And-Or Graph (AOG). When the activities are represented as a graph, finding a similar activity may involve matching similar graphs in a video database.
Graph matching may include two categories, exact matching and inexact matching. Exact matching generally requires isomorphism such that vertices and connected edges need to be exactly mapped between two graphs or subgraphs. In addition, exact graph matching is NP-complete. On the other hand, inexact graph matching includes mapping between subsets of vertices with relaxed edge connectivity. It typically finds suboptimal solutions, instead, in polynomial time. See, e.g., D. Conte, P. Foggia, C. Sansone, M. Vento, “Thirty Years Of Graph Matching In Pattern Recognition,” Int. Journal of Pat. Rec. and Art. Int., Vol. 18, No. 3, pp. 265-298, 2004 (referred to herein as “Conte”), which is incorporated by reference herein in its entirety. The condition for exact matching may be quite rigid and typically makes it difficult to match graphs.
One type of inexact matching uses subgraph indexing for video retrieval. Graphs may be broken down into subgraphs, and these subgraphs may be used for retrieving videos. See, e.g., K. Shearer, H. Bunke, S. Venkatesh, “Video indexing and similarity retrieval by largest common subgraph detection using decision trees,” Pattern Recognition, 2001 (referred to herein as “Shearer”), which is incorporated by reference herein in its entirety. In this system, similar videos are retrieved by simply finding the largest common subgraph. However, the number of subgraphs associated with a graph even of a fairly simple video scene may run in to the thousands, or even millions. Thus, a comparison a for a largest common subgraph may require large processing and storage capabilities.