This invention relates to motion event detection as used for example in surveillance.
Advances in multimedia technology, including commercial prospects for video-on-demand and digital library systems, has generated recent interest in content-based video analysis. Video data offers users of multimedia systems a wealth of information; however, it is not as readily manipulated as other data such as text. Raw video data has no immediate xe2x80x9chandlesxe2x80x9d by which the multimedia system user may analyze its contents. Annotating video data with symbolic information describing its semantic content facilitates analysis beyond simple serial playback.
Video data poses unique problems for multimedia information systems that test does note Textual data is a symbolic abstraction of the spoken word that is usually generated and structured by humans. Video, on the other hand, is a direct recording of visual information. In its raw and most common form, video data is subject to little human-imposed structure, and thus has no immediate xe2x80x9chandlesxe2x80x9d by which the multimedia system user may analyze its contents.
For example, consider an on-line movie screenplay (textual data) and a digitized movie (video and audio data). If one were analyzing the screenplay and interested in searching for instances of the word xe2x80x9chorsexe2x80x9d in the text, many text searching algorithms could be employed to locate every instance of this symbol as desired. Such analysis is common in on-line text databases. If, however, one were interested in serching for every scene in the digitized movie where a horse appeared, the task is much more difficult. Unless a human performs some sort of pre-processing of the video data, there are no symbolic keys on which to search. For a computer to assist in the search, it must analyze the semantic content of the video data itself. Without such capabilities, the information available to the multimedia system user is greatly reduced.
Thus, much research in video analysis focuses on semantic content-based search and retrieval techniques. The term xe2x80x9cvideo indexingxe2x80x9d as used herein refers to the process of marking important frames or objects in the video data for efficient playback. An indexed video sequence allows a user not only to play the sequence in the usual serial fashion, but also to xe2x80x9cjumpxe2x80x9d to points of interest while it plays. A common indexing scheme is to employ scene cut detection to determine breakpoints in the video data. See H. Zang, A. Kankanhalli, and Stephen W. Smoliar, Automatic Partitioning of Full Motion Video, Multimedia Systems, 1, 10-28 (1993). Indexing has also been performed based on camera (i.e., viewpoint) motion, see A. Akutsu, Y. Tonomura, H. Hashimoto, and Y. Ohba, Video Indexing Using Motion Vectors, in Petros Maragos, editor, Visual Communications and Image Processing SPIE 1818, 1552-1530(1992), and object motion, see M. Ioka and M. Kurokawa, A Method for Retrieving Sequences of Images on the Basis of Motion Analysis, in Image Storage and Retrieval Systems, Proc. SPIE 1662, 35-46 (1992), and S. Y. Lee and H. M. Kao, Video Indexing-an approach based on moving object and track, in Wayne Niblack, editor, Storage and Retrieval for Image and Video Databases, Proc. SPIE 1908, 25-36 (1993).
Using breakpoints found via scene cut detection, other researchers have pursued hierarchical segmentation to analyze the logical organization of video sequences. For more on this, see the following: G. Davenport, T. Smith, and N. Pincever, Cinematic Primitives for Multimedia, IEEE Computer Graphics and Applications, 67-74 (1991); M. Shibata, A temporal Segmentation Method for Video Sequences, in Petros Maragos, editor, Visual Communications and Image Processing, Proc SPIE 1818, 1194-1205 (1992); D. Swanberg, C-F. Shu, and R. Jain, Knowledge Guided Parm in Video Databases in Wayne Niblack, editor, Storage and Retrieval for Image and Video Databases, Proc. SPIE 1908, 13-24 (1993). in the same way that text is organized into sentences, paragraphs and chapters, the goal of these techniques is to determine a hierarchical grouping of video sub-sequences. Combining this structural information with content abstractions of segmented sub-sequences provides multimedia system users a top-down view of video data For more details see F. Arman, R. Depommier, A. Hsu, and M. Y. Chiu, Content-Based Browsing of Video Sequences, in Proceedings of ACM International Conference on Multimedia, (1994).
Closed-circuit television (CCTV) systems provide security personnel a wealth of information regarding activity in both indoor and outdoor domains. However, few tools exist that provide automated or assisted analysis of video data; therefore, the information from most security cameras is under-utilized. Security systems typically process video camera output by either displaying the video on monitors for simultaneous viewing by security personnel and/or recording the data to time-lapse VCR machines for later playback. Serious limitations exist in these approaches:
Psycho-visual studies have shown that humans are limited in the amount of visual information they can process in tasks like video camera monitoring. After a time, visual activity in the monitors can easily go unnoticed. Monitoring effectiveness is additionally taxed when output from multiple video cameras must be viewed.
Time-lapse VCRs are limited in the amount of data that they can store in terms of resolution, frames per second, and length of recordings. Continuous use of such devices requires frequent equipment maintenance and repair.
In both cases, the video information is unstructured and un-indexed. Without an efficient means to locate visual events of interest in the video stream, it is not cost-effective for security personnel to monitor or record the output from all available video cameras.
Video motion detectors are the most powerful of available tools to assist in video monitoring. Such system detect visual movement in a video stream and can activate alarms or recording equipment when activity exceeds a pre-set threshold. However, existing video motion detectors typically sense only simple intensity changes in the video data and cannot provide more intelligent feedback regarding the occurrence of complex object actions such as inventory theft.
In accordance with one embodiment of the present invention, a method is provided to perform video indexing from object motion. Moving objects are detected in a video sequence using a motion segmentor. Segmented video objects are recorded and tracked through successive frames. The path of the objects and intersection with paths of the other objects are determined to detect occurrence of events. An index mark is placed to identify these events of interest such as appearance/disappearance, deposit/removal, entrance/exit, and motion/rest of objects.
These and other features of the invention that will be apparent to those skilled in the art from the following detailed description of the invention, taken together with the accompanying drawings.