Advances in multimedia technology, including commercial prospects for video-on-demand and digital library systems, has generated recent interest in content-based video analysis. Video data offers users of multimedia systems a wealth of information; however, it is not as readily manipulated as other data such as text. Raw video data has no immediate "handles" by which the multimedia system user may analyze its contents. Annotating video data with symbolic information describing its semantic content facilitates analysis beyond simple serial playback.
Video data poses unique problems for multimedia information systems that text does not. Textual data is a symbolic abstraction of the spoken word that is usually generated and structured by humans. Video, on the other hand, is a direct recording of visual information. In its raw and most common form, video data is subject to little human-imposed structure, and thus has no immediate "handles" by which the multimedia system user may analyze its contents.
For example, consider an on-line movie screenplay (textual data) and a digitized movie (video and audio data). If one were analyzing the screenplay and interested in searching for instances of the word "horse" in the text, many text searching algorithms could be employed to locate every instance of this symbol as desired. Such analysis is common in on-line text databases. If, however, one were interested in searching for every scene in the digitized movie where a horse appeared, the task is much more difficult. Unless a human performs some sort of pre-processing of the video data, there are no symbolic keys on which to search. For a computer to assist in the search, it must analyze the semantic content of the video data itself. Without such capabilities, the information available to the multimedia system user is greatly reduced.
Thus, much research in video analysis focuses on semantic content-based search and retrieval techniques. The term "video indexing" as used herein refers to the process of marking important frames or objects in the video data for efficient playback. An indexed video sequence allows a user not only to play the sequence in the usual serial fashion, but also to "jump" to points of interest while it plays. A common indexing scheme is to employ scene cut detection to determine breakpoints in the video data. See H. Zang, A. Kankanhalli, and Stephen W. Smoliar, Automatic Partitioning of Full Motion Video, Multimedia Systems, 1, 10-28 (1993). Indexing has also been performed based on camera (i.e., viewpoint) motion, see A. Akutsu, Y. Tonomura, H. Hashimoto, and Y. Ohba, Video Indexing Using Motion Vectors, in Petros Maragos, editor, Visual Communications and Image Processing SPIE 1818, 1552-1530 (1992), and object motion, see M. Ioka and M. Kurokawa, A Method for Retrieving Sequences of Images on the Basis of Motion Analysis, in Image Storage and Retrieval Systems, Proc. SPIE 1662, 35-46 (1992), and S. Y. Lee and H. M. Kao, Video Indexing-an approach based on moving object and track, in Wayne Niblack, editor, Storage and Retrieval for Image and Video Databases, Proc. SPIE 1908, 25-36 (1993).
Using breakpoints found via scene cut detection, other researchers have pursued hierarchical segmentation to analyze the logical organization of video sequences. For more on this, see the following: G. Davenport, T. Smith, and N. Pincever, Cinematic Primitives for Multimedia, IEEE Computer Graphics & Applications, 67-74 (1991); M. Shibata, A temporal Segmentation Method for Video Sequences, in Petros Maragos, editor, Visual Communications and Image Processing, Proc SPIE 1818, 1194-1205 (1992); D. Swanberg, C-F. Shu, and R. Jain, Knowledge Guided Parsing in Video Databases in Wayne Niblack, editor, Storage and Retrieval for Image and Video Databases, Proc. SPIE 1908, 13-24 (1993). In the same way that text is organized into sentences, paragraphs and chapters, the goal of these techniques is to determine a hierarchical grouping of video sub-sequences. Combining this structural information with content abstractions of segmented sub-sequences provides multimedia system users a top-down view of video data. For more details see F. Arman, R. Depommier, A. Hsu, and M. Y. Chiu, Content-Based Browsing of Video Sequences, in Proceedings of ACM International Conference on Multimedia, (1994).
Closed-circuit television (CCTV) systems provide security personnel a wealth of information regarding activity in both indoor and outdoor domains. However, few tools exist that provide automated or assisted analysis of video data; therefore, the information from most security cameras is under-utilized.
Security systems typically process video camera output by either displaying the video on monitors for simultaneous viewing by security personnel and/or recording the data to time-lapse VCR machines for later playback. Serious limitations exist in these approaches:
Psycho-visual studies have shown that humans are limited in the amount of visual information they can process in tasks like video camera monitoring. After a time, visual activity in the monitors can easily go unnoticed. Monitoring effectiveness is additionally taxed when output from multiple video cameras must be viewed.
Time-lapse VCRs are limited in the amount of data that they can store in terms of resolution, frames per second, and length of recordings. Continuous use of such devices requires frequent equipment maintenance and repair.
In both cases, the video information is unstructured and un-indexed. Without an efficient means to locate visual events of interest in the video stream, it is not cost-effective for security personnel to monitor or record the output from all available video cameras.
Video motion detectors are the most powerful of available tools to assist in video monitoring. Such systems detect visual movement in a video stream and can activate alarms or recording equipment when activity exceeds a pre-set threshold. However, existing video motion detectors typically sense only simple intensity changes in the video data and cannot provide more intelligent feedback regarding the occurrence of complex object actions such as inventory theft.