1. Field of Invention
This invention relates to video indexing, archiving, editing and production, and more particularly, to analyzing video content using video segmentation techniques.
2. Description of Related Art
Much information in today's world is captured on video. However, the volume of video sources makes finding a specific video sequence difficult. The time-dependent nature of video also makes it a very difficult medium to manage. Thus, to be a useful information source, the video should be indexed. Indexing can be used to identify shots and sharp breaks and gradual transitions between shots. However, much of the vast quantity of video containing valuable information remains unindexed.
At least part of the reason is that many current indexing systems require an operation to view the entire video sequence and to manually assign indexing to each scene in the video. That is, an operator must view the entire video through a sequential scan and assign index values to each frame of the video. This process is slow and unreliable, especially compared to systems that are used for searching text-based data. Thus, the effective use of video is limited by a lack of a viable system that enables easy and accurate organization and retrieval of information.
Current systems for indexing video can detect sharp breaks in a video sequence, but are generally not effective for detecting gradual transitions between video sequences. These systems use techniques for computing feature values within each frame in a video sequence. The simplest technique is to count the number of pixels that change in value more than some threshold, such as described in Zhang et al., "Automatic Partitioning of Full-motion Video," Multimedia Systems, 1993, Vol. 1, No. 1, pp. 10-28 (hereafter Zhang). Zhang used this technique on smoothed images, and obtained very good results for thresholds tailored to each specific video. Another technique compares images based on statistical properties. For example, Kasturi and Jain, "Dynamic Vision" in Computer Vision: Principles, IEEE Computer Society Press, Washington, D.C., 1991, discloses computing a feature value difference based on the mean and standard deviation of gray levels in regions of the frames.
The most widely used technique for detecting shot boundaries is based on color or gray-level histograms. In this technique, if the bin-wise difference between histograms for adjacent frames exceeds a threshold, a shot boundary is assumed. Zhang used this technique with two thresholds in order to detect gradual transitions.
Arman et al., "Image Processing on Encoded Video Sequences," Multimedia Systems, 1994, Vol. 1, No. 6, pp. 211-219, computed frame-to-frame differences without decompressing images using differences between the discrete cosine transform (DCT) coefficients for adjacent frames. Zabih et al., "A Feature-based Algorithm for Detecting and Classifying Scenes & Breaks," Proc. ACM Multimedia 95, San Francisco, Calif., November 1998, pp. 189-200, discloses comparing the number and position of edges in adjacent frames. Shot boundaries were detected when the percentage of edges entering and exiting between the adjacent frames exceeded a threshold. Dissolves and fades were indexed by looking at the relative values of the entering and exiting edge percentages. Finally, Phillips and Wolf, "Video Segmentation Technologies for News," in Multimedia Storage and Archiving Systems, Proc. SPIE 2916, 1996, pp. 243-251 discloses computing the sums of the magnitudes of the motion vectors within an image, and used these sums alone or with a histogram difference to detect shot boundaries.
One drawback with the above-described techniques is the need for manual threshold selection. Even when done very carefully, this step can introduce errors because of differences between videos. Thus, a threshold selected for one video may not be appropriate for another video. Further, when applied to gradual transitions, this technique makes establishing a threshold even harder because the feature value differences between frames may be very small.
Another problem with current techniques involves feature selection. Many features, such as brightness, work well for classifying sharp breaks or cuts. However, the same feature is often difficult to apply to identifying gradual transitions.
Finally, many video segments could be more accurately identified by applying multiple features to the video segment. However, current techniques do not generally allow use of multiple features.