Automatic event detection and scene understanding is an important enabling technology for video surveillance, security, and forensic analysis applications. The task involves identifying objects in the scene, describing their inter-relations, and detecting events of interest. In recent years, there has been a proliferation of digital cameras and networked video storage systems, generating enormous amounts of video data, necessitating efficient video processing. Video analysis is used in many areas including surveillance and security, forensics analysis, and intelligence gathering applications. Currently, much of the video is monitored by human operators, but while people are good at understanding video data, they are not effective in reviewing large amounts of video due to short attention spans, vulnerability to interruptions or distractions, and difficulty in processing multiple video streams.
Recent advances in computer vision technology and computing power have produced specific capabilities such as object detection and tracking, and even textual annotation of video and searchability. A number of publications, listed below and incorporated by reference herein in their entirety, explain various aspects of these capabilities:    C. Pollard, I. A. Sag, “Head-Driven Phrase Structure Grammar,” University of Chicago Press, Chicago, Ill., 1994.    R. Nevatia, J. Hobbs, B. Bolles, “An Ontology for Video Event Representation,” IEEE Workshop on Event Detection and Recognition, June 2004.    S. C. Zhu, D. B. Mumford, “Quest for a stochastic grammar of images,” Foundations and Trends of Computer Graphics and Vision, 2006.    Mun Wai Lee, Asaad Hakeem, Niels Haering, and Song-Chun Zhu, “SAVE: A Framework for Semantic Annotation of Visual Events,” Proc. 1st Int'l Workshop on Internet Vision, Anchorage, Ak., June, 2008.    Hakeem, M. Lee, O. Javed, N. Haering, “Semantic Video Search using Natural Language Queries,” ACM Multimedia, 2009.    Benjamin Yao, Xiong Yang, Liang Lin, Mun Wai Lee, and Song-Chun Zhu, “I2T: Image Parsing to Text Description,” Proceedings of IEEE, Vol 98, no. 8, pp 1485-1508, August, 2010.    Tom Simonite, “Surveillance Software Knows What a Camera Sees,” Technology Review, MIT, Jun. 1, 2010.    Zeeshan Rasheed, Geoff Taylor, Li Yu, Mun Wai Lee, Tae Eun Choe, Feng Guo, Asaad Hakeem, Krishnan Ramnath, Martin Smith, Atul Kanaujia, Dana Eubanks, Niels Haering, “Rapidly Deployable Video Analysis Sensor Units for Wide Area Surveillance,” First IEEE Workshop on Camera Networks (WCN2010), held in conjunction with CVPR 2010, Jun. 14, 2010.    Tae Eun Choe, Mun Wai Lee, Niels Haering, “Traffic Analysis with Low Frame Rate Camera Network”, First IEEE Workshop on Camera Networks (WCN2010), held in conjunction with CVPR 2010, Jun. 14, 2010.
However, scene understanding and searchability can benefit from a more thorough understanding of objects, scene elements and their inter-relations, and more comprehensive and seamless textual annotation.