Efficient representation of visual content of video streams has emerged as the primary functionality in distributed multimedia applications, including video-on-demand, interactive video, content-based search and manipulation, and automatic analysis of surveillance video. A video stream is a temporally evolving medium where content changes occur due to camera shot changes, special effects, and object/camera motion within the video sequence. Temporal video segmentation constitutes the first step in content-based video analysis, and refers to breaking the input video sequence into multiple temporal units (segments) based upon certain uniformity criteria.
Automatic temporal segmentation of video sequences has previously centered around the detection of individual camera shots, where each shot contains the temporal sequence of frames generated during a single operation of the camera. Shot detection is performed by computing frame-to-frame similarity metrics to distinguish intershot variations, which are introduced by transitions from one camera shot to the next, from intrashot variations, which are introduced by object and or camera movement as well as by changes in illumination. Such methods are collectively known as video shot boundary detection (SBD). Various SBD methods for temporal video segmentation have been developed. These methods can be broadly divided into three classes, each employing different frame-to-frame similarity metrics: (1) pixelblock comparison methods, (2) intensity/color histogram comparison methods, and (3) methods which operate only on compressed, i.e., MPEG encoded video sequences (see K. R. Kao and J. J. Hwang, Techniques and Standards for Image, Video and Audio Coding, Chapters 10-12, Prentice-Hall, N.J., 1996).
The pixel-based comparison methods detect dissimilarities between two video frames by comparing the differences in intensity values of corresponding pixels in the two frames. The number of the pixels changed are counted and a camera shot boundary is declared if the percentage of the total number of pixels changed exceeds a certain threshold value (see H J. Zhang, A, Kankanhalli and S. W. Smoliar, "Automatic partitioning of full-motion video," ACM/Springer Multimedia Systems, Vol. 1(1), pp. 10-28, 1993). This type of method can produce numerous false shot boundaries due to slight camera movement, e.g., pan or zoom, and or object movement. Additionally, the proper threshold value is a function of video content and, consequently, requires trial-and-error adjustment to achieve optimum performance for any given video sequence.
The use of intensity/color histograms for frame content comparison is more robust to noise and object/camera motion, since the histogram takes into account only global intensity/color characteristics of each frame. With this method, a shot boundary is detected if the dissimilarity between the histograms of two adjacent frames is greater than a pre-specified threshold value (see H. J. Zhang, A. Kankanhalli and S. W. Smoliar, "Automatic partitioning of full-motion video", ACM/Springer Multimedia Systems, Vol. 1(1), pp. 10-28, 1993). As with the pixel-based comparison method, selecting a small threshold value will lead to false detections of shot boundaries due to the object and or camera motions within the video sequence. Additionally, if the adjacent shots have similar global color characteristics but different content, the histogram dissimilarity will be small and the shot boundary will go undetected.
Temporal segmentation methods have also been developed for use with MPEG encoded video sequences (see F. Arman, A. Hsu and M. Y. Chiu, "Image Processing on Compressed Data for Large Video Databases," Proceedings of the 1st ACM International Conference on Multimedia, pp. 267-272, 1993). Temporal segmentation methods which work on this form of video data analyze the Discrete Cosine Transform (DCT) coefficients of the compressed data to find highly dissimilar consecutive frames which correspond to camera breaks. Again, content dependent threshold values are required to properly identify the dissimilar frames in the sequence that are associated with camera shot boundaries. Additionally, numerous applications require input directly from a video source (tape and or camera), or from video sequences which are stored in different formats, such as QuickTime, SGImovie, and AVI. For these sequences, methods which work only on MPEG compressed video data are not suitable as they would require encoding the video data into an MPEG format prior to SBD. Additionally, the quality of MPEG encoded data can vary greatly, thus causing the temporal segmentation from such encoded video data to be a function of the encoding as well as the content.
The fundamental drawback of the hereinabove described methods is that they do not allow for fully automatic processing based upon the content of an arbitrary input video, i.e., they are not truly domain independent. While the assumption of domain independence is valid for computation of the frame similarity metrics, it clearly does not apply to the decision criteria, particularly the selection of the threshold values. Reported studies (see D. C. Coil and G. K. Choma, "Image Activity Characteristics in Broadcast Television," IEEE Transactions on Communication, pp. 1201-1206, Oct. 1976) on the statistical behavior of video frame differences clearly show that a threshold value that is appropriate for one type of video content will not yield acceptable results for another type of video content.
Another drawback of the hereinabove methods is that they are fundamentally designed for the identification of individual camera shots. i.e., temporal content changes between adjacent frames. Complete content-based temporal segmentation of video sequences must also include identification of temporal segments associated with significant content changes within shots as well as a the temporal segments associated with video editing effects, i.e., fade, dissolve, and uniform intensity segments. Methods have be developed to specifically detect fade (U.S. Pat. No. 5,245,436) and dissolve (U.S. Pat. No. 5,283,645) segments in video sequences, but when any of the hereinabove methods are modified in an attempt to detect the total set of possible temporal segments, their performance is compromised. Such modifications commonly require more content dependent thresholds, each of which must be established for the specific video content before optimum performance can be achieved.
Therefore, there is a need for a method and system for performing accurate and automatic content-based temporal segmentation of video sequences.