1. Technical Field
The present invention is generally directed to the fields of automatic video analysis and video compression. More specifically, the present invention is directed to a mechanism for performing automatic video analysis and video compression on video data provided by video input devices in order to generate representations of the video data using a low bandwidth data stream.
2. Description of Related Art
Video compression and automatic video analysis for tracking of moving objects are both very active areas of research. However, these have been disconnected areas of research. Video compression deals with minimizing the size of the video data in the video stream while video analysis is concerned with determining the content of video data.
In the context of video monitoring systems, such as video surveillance or security systems, the index data will alert the monitoring user to the presence of an interesting activity in the scene. However, in order to take an action, the user needs to view the corresponding video to gain a complete understanding of the activity. This feature is very essential since most automatic video analysis systems have errors in the event detection and will often indicate activity that is of little or no interest to the human being monitoring the video.
Current automatic video analysis systems analyze the video and generate an index. A typical video index may consist of a temporal reference into the video stream and a descriptor, where the descriptor may be a semantic token (e.g., the presence of a human face and cardinality) or a feature descriptor of the video (e.g., color histogram of the dominate objects). The implicit assumption of video indexing systems is that the actual video data will be available to the monitoring user when they choose to use the index to review the actual video. More information about such video analysis systems is available in the Handbook of Video Databases, Design and Applications by Fruth and Marques, CRC Press, 2003.
Many different types of video analysis systems have been devised for use in determining the content of video data. For example, U.S. Patent Application Publication No. 20030123850 to Jun et al. discloses a system that analyzes news video and automatically detects anchor person and other types of segments to generate temporal indices into the news video. The Jun systems uses the index information to provide content based access to the indexed segments and also allows for different reproduction speeds for different types of segments. This system requires both the index and the original video to allow a user to browse the news video.
U.S. Pat. No. 6,366,269, issued to Boreczky et al., describes a media file browser where the file is accessed based on a user selected feature. For example, a user may choose to jump to a point in the media file where there is an audio transition from music to speech or a visual transition from one scene to the other. This system also requires both the index and the original video to allow a user to browse the video content based on the index.
U.S. Pat. No. 6,560,281, issued to Black et al., is directed to a system which can analyze video data from a presentation, cluster frames into segments corresponding to each overhead slide used in the presentation, recognize gestures by the speaker in the video and use this information to generate a condensed version of the presentation. In this system, the condensed version of the video data can be used independently, i.e. without using the original video. However, the condensed version of the video data is not a complete representation of the original video.
U.S. Pat. No. 6,271,892, issued to Gibbon et al., describes a system that extracts key frames from video data and associates it with corresponding closed captioning text. This information may be rendered in a variety of ways, e.g., a page with printed key frames with associated closed captioning, to give a summary of the video. This system is in principle similar to the Black system discussed above and suffers the same drawback that the summary of the video is not a complete representation of the video data.
Current video surveillance and tracking systems analyze video to detect and track objects. They use the object tracking information to infer the occurrence of certain events in the video to thereby generate event markers. These systems then use these event markers as indices for viewing the original video.
For example, U.S. Pat. No. 5,969,755, issued to Courtney, describes a video surveillance system which incorporates object detection and tracking. The Courtney system generates a symbolic representation of the video based on the object tracking information. The Courtney system also uses the object tracking information to infer events in the video such as appearance/disappearance, deposit/removal, entrance/exit, etc. The Courtney system uses these event markers to retrieve relevant bits of the video for the user. The key drawback of the Courtney system, and systems like it, is that it requires both the index information, i.e. the event marker information, and the original video in order for the user to be able to make an independent assessment of the event.
U.S. Pat. No. 6,385,772, which is also issued to Courtney, describes a video surveillance system that uses a wireless link to transmit video to a portable unit. The video surveillance system uses motion detection as a trigger to transmit a video frame to the portable unit so that the user can make an assessment of the event. This system, while linking up a viewable representation of a detected event, does not provide a complete representation of the video corresponding to the event. Thus, the Courtney system limits the ability of the user to make assessments of the situation without accessing the original video footage.
U.S. Patent Application Publication No. 20030044045 to Schoepflin discloses a system for tracking a user selected object in a video sequence. In the Schoepflin reference an initial selection is used as a basis for updating both the foreground and background appearance models. This system, while discussing object tracking, does not address both the event detection problem and the problem of generating a complete representation of the video data.
U.S. Patent Application Publication No. 20010035907 to Boemmelsiek describes a video surveillance system which uses object detection and tracking to reduce the information in a video signal. The detected objects are used as a basis for generating events which are used to index the original video data. This system again has the drawback of requiring the original video data for the user to make an independent assessment of the detected event.
Current video compression systems are completely focused on reducing the number of bits required to store the video data. However, these video compression systems do not concern themselves with indexing the video in any form. For example, U.S. Patent Application Publication No. 20030081564 to Chan discloses a wireless video surveillance system where the data from a video camera is transmitted over a wireless link to a computer display. Such a system provides access to video data from the camera without any regard to event detection. Thus, this system requires that the user view the video in order to detect events himself.
U.S. Pat. No. 5,933,535, issued to Lee et al., teaches a method of using objects or object features as the basis for compression, as opposed to rectangular blocks. This results in higher compression efficiency and lower errors. This method, while using the object properties to reduce the bandwidth required to transmit the video data, does not look at the event behavior of the objects.
U.S. Pat. No. 6,614,847, issued to Das et al., discloses an object oriented video compression system which decomposes the video data in regions corresponding to objects and uses these regions as the basis for compression. However, this system, like most other compression systems, does not incorporate any video event information.