The present invention relates generally to processing digital video information and, more specifically, to parsing and indexing compressed video streams.
Digitized video provides significant improvements over analog video media with regard to video image manipulation. Digital video data can be readily compressed and decompressed, thereby enabling efficient transmission between remote sites. Efficient compression and decompression also enhance performance in the storage and retrieval of video data. As computer networks improve and video libraries become more accessible to homes and offices via the Internet, the importance of sufficient bandwidth to support video transmission and efficient methods for video indexing, retrieval, and browsing becomes more acute. However, the effective utilization of video resources is hampered by sometimes inadequate organization of video information and a need for further improvements in the performance of retrieval systems for video information.
The time dependent nature of video makes it a uniquely challenging media to, process. Several compression standards have been developed and implemented within the last two decades for video compression, including MPEG-1 and MPEG-2. Numerous techniques for video indexing and retrieval have been developed within the parameters defined by MPEG-1 and MPEG-2. U.S. Pat. No. 5,719,643 to Nakajima describes a method for detecting scene cuts in which an input image and a reference image are entered into an image processing unit and both are converted to contracted images. The contracted input image is compared to the contracted reference image to determine an interframe difference in luminance signals of the input and reference frames and temporal changes between the input and reference frames. Based on the comparison, a determination is made as to whether the input frame is a cut frame, a non-cut frame, or a cut-frame candidate.
It is also known in the art to select key frames from video sequences in order to use the selected frames as representative frames to convey the content of the video sequences from which they are chosen. The key frames are extracted from the video sequences in a manner which is similar to the determination of scene cuts, otherwise known as shot boundaries. A reference frame is compared to an input frame to determine whether the two frames are sufficiently different that a preselected difference threshold has been exceeded. Key frames can be used to enable users of retrieval systems to efficiently browse an entire video sequence by viewing only key frames. Key frames can also be utilized in video retrieval so that only key frames of a video sequence will be searched instead of searching all frames within a video sequence.
The current methods for detecting shot boundaries, extracting key frames, and video retrieval all rely on dissimilarities or similarities between video frames. However, reliance on global descriptions of video frames does not always provide the desired precision in video indexing and retrieval. For example, users of a video retrieval system might have particular subject matter within a video frame which they desire to retrieve without knowledge of any background information which might accompany the subject matter in any particular shot. Utilizing the current video retrieval methods, which rely on global descriptions of video frames, users might well be unable to locate a video shot containing the desired subject matter.
What is needed is a method and system which enables efficient indexing, retrieval, and browsing of compressed video at a higher level of detail than is available in the current art.
A method for parsing, indexing and retrieving compressed video data includes indexing video frames within a compressed video stream based on a comparison of video objects within frames of the compressed video stream. A first configuration of video objects in a first video frame and a second configuration of video objects in a second video frame are identified, wherein each first frame video object and each second frame video object has an associated quantitative attribute. A comparison of a first quantitative attribute associated with a first frame video object to a second quantitative attribute associated with a second frame video object is performed to ascertain whether a difference between the first and second quantitative attributes exceeds a predetermined threshold. If the predetermined threshold is exceeded, a video frame is selected from a sequence of video frames bounded by the first and second video frames, and the selected frame is used for indexing purposes.
In a preferred embodiment, the method is performed to identify shot boundaries and key instances of video objects, extract key video frames, detect camera operations, detect special effects video editing, and to enable video retrieval and browsing.
The quantitative attributes of video objects relate to at least one of size, shape, motion, or texture. Shot boundaries are detected within a video sequence by selecting the first video frame, which might be an initial video frame in a sequence, and the second video frame such that the first video frame is separated by some predetermined number of video frames from the second video frame. First quantitative attributes associated with the first frame video objects are calculated and compared to second quantitative attributes associated with second frame video objects to determine a quantitative attribute differential between the first and second frames. Alternatively, a quantity of first frame video objects is calculated and compared to a quantity of second frame video objects to determine a video object quantity differential between the first and second video frames. In a first embodiment, the quantitative attribute differential is compared to a shot boundary threshold. If the quantitative attribute differential exceeds the shot boundary threshold, a shot boundary is indexed in the video sequence bounded by first and second video frames. In a second embodiment, the video object quantity differential is compared to a shot boundary threshold to determine if it exceeds the shot boundary threshold to determine if the threshold is exceeded and, if the threshold is exceeded, a shot boundary is indexed. This process is repeated by selecting subsequent first video frames and subsequent second video frames for shot boundary analysis to identify additional shot boundaries in the video sequence.
Within the video shots defined by the shot boundaries, key instances of objects, key frames, camera operations, and special effects video edits are identified. Key instances of video objects are selected within the shot boundaries by calculating first quantitative attributes of a first instance of a video object in a first frame and second quantitative attributes of a second instance of the video object in a second frame and calculating a quantitative attribute differential between the first and second instances of the video object. The quantitative attribute differential is compared to a key instance threshold and, if the differential exceeds the threshold, a key instance of the object is, selected from the video sequence bounded by the first and second video frames. The calculation of the quantitative attribute differential captures a wide variety of instance-to-instance transitions which can trigger selections of key instances of video objects. For example, a sequence in which the camera zooms in on the video object rapidly results in a size differential between first and second instances of the video object, which alone is sufficient to exceed the threshold. Alternatively, a combination of changes in quantitative attributes for a video object, such as size and shape, might exceed the threshold, even though none of the quantitative attribute changes in isolation would be sufficient to exceed the threshold.
Key frames are extracted from the various shots of the video sequence defined by shot boundaries by calculating quantitative attributes of first video frame objects and second video frame objects. For each key frame extraction procedure, a quantitative attribute differential is calculated by comparing the first frame quantitative attributes to the second frame quantitative attributes. The quantitative attribute differential is compared to a key frame threshold to determine if a key frame should be selected for the video sequence bounded by the first and second video frames.
To detect camera operations such as zooming, panning, and tracking, motion histograms are calculated for the video objects of selected first and second video frames of the video sequence. The motion histograms identify the direction of motion vectors of each object and the magnitudes of the vectors. A comparison is performed of the motion histograms for the video objects of the selected video frames to ideal motion histograms, each of which represents a different camera operation. Each calculated motion histogram is compared to an ideal zoom histogram, an ideal tracking histogram, and an ideal panning histogram. If the similarity that is calculated for one of the ideal motion histograms exceeds a predetermined threshold, one of the frames of the video sequence is selected for the purpose of indexing a camera operation sequence.
Special effects video editing, such as wipes, fade-in/fade-out, and dissolve, is also calculated using object-based indexing. These special editing effects are not necessarily detected by the shot boundary detection because they are not sharp cuts within the video sequence, but are instead gradual transitions. Special effects edits are detected by comparing first video frame alpha maps to second video frame alpha maps to calculate an alpha map differential which is compared to a special effect video edit threshold. If the threshold is exceeded, an index entry is made to indicate the presence in the sequence of special effects video editing.
Object-based indexing advantageously enables object-based video retrieval. A video query which identifies a query object is received and quantitative attributes are calculated for the object. The query object quantitative attributes are compared to quantitative attributes of video objects within the video sequence to select a query match. In a preferred embodiment, only the key instances of objects within the shots are searched in order to conserve processing resources. Alternatively, all instances of the video objects within each shot can be searched.
A system for object-based video indexing and retrieval includes a configuration detector which ascertains a configuration of video objects in video frames of a video sequence. The configuration includes the quantity of video objects in a frame, the quantitative characteristics of the objects, and their relative orientation. A configuration comparison processor calculates differentials between quantitative attributes of first frame video objects and second frame video objects, differentials for the quantities of objects between first and second video frames, and differentials in the relative orientation of the video objects.
Multiple threshold detection devices are in communication with the configuration comparison processor to detect shot boundaries, key frames, key instances of objects, camera operations, special effects video edits, and query matches. A shot boundary detector detects object quantity differentials which exceed a shot boundary threshold in a first embodiment and quantitative attribute differentials which exceed a shot boundary threshold in a second embodiment. A key frame selector recognizes object orientation differentials which exceed a key frame threshold. A key instance selector detects differences between quantitative attributes of first instances of video objects and quantitative attributes of second instances of the video objects which exceed a key instance threshold. A camera operation detector is configured to detect similarities between calculated motion histograms of video frames and ideal motion histograms which are in excess of a similarity threshold. A special effects detector is configured to detect differentials between alpha maps which exceed a special effect video edit threshold. Finally, a query match detector detects similarities between quantitative attributes of query objects and quantitative attributes of video objects which exceed a query threshold. An indexing device is responsive to the various threshold detection devices to make an index entry in response to the detection of exceeded thresholds.
An advantage provided by the present invention resides in the ability to refine indexing of video sequences to the level of video objects as opposed to entire video frames. A further advantage of the present invention is that video retrieval can be focused more precisely by allowing object-based searching of video sequences in contrast to prior frame-based searching.