1. Field of the Invention
The present invention relates to video processing, and, in particular, to the detection of transitions, such as cuts, wipes, and dissolves, in sequences of digital video images.
2. Description of the Related Art
Images, or frames, in a digital video sequence are typically represented by arrays of picture elements or pixels, where each pixel is represented by one or more different components. For example, in a monochrome gray-scale image, each pixel is represented by a component whose value corresponds to the intensity of the pixel. In an RGB color format, each pixel is represented by a red component, a green component, and a blue component. Similarly, in a YUV color format, each pixel is represented by an intensity (or luminance) component Y and two color (or chrominance) components U and V. In 24-bit versions of these color formats, each pixel component is represented by an 8-bit value.
A typical video sequence is made up of sets of consecutive frames called shots, where the frames of a given shot correspond to the same basic scene. A shot is an unbroken sequence of frames from one camera. There is currently considerable interest in the parsing of digital video into its constituent shots. This is driven by the rapidly increasing availability of video material, and the need to create indexes for video databases. Parsing is also useful for video editing and compression.
One way to distinguish different shots in a digital video sequence is to analyze histograms corresponding to the video frames. A frame histogram is a representation of the distribution of component values for a frame of video data. For example, a 32-bin histogram may be generated for the 8-bit Y components of a video frame represented in a 24-bit YUV color format, where the first bin indicates how many pixels in the frame have Y values between 0 and 7 inclusive, the second bin indicates how many pixels have Y values between 8 and 15 inclusive, and so on until the thirty-second bin indicates how many pixels have Y values between 248 and 255 inclusive.
Multi-dimensional histograms can also be generated based on combinations of bits from different components. For example, a three-dimensional (3D) histogram can be generated for YUV data using the 4 most significant bits (MSBs) of the Y components and the 3 MSBs of the U and V components, where each bin in the 3D histogram corresponds to the number of pixels in the frame having the same 4 MSBs of Y, the same 3 MSBs of U, and the same 3 MSBs of V.
One general characteristic of video sequences is that, for certain typos of histograms, the histograms for frames within a given shot are typically more similar than the histograms for frames in shots corresponding to different scenes. As such, one way to parse digital video into its constituent shots is to look for the transitions between shots by comparing histograms generated for the frames in the video sequence. In general, frames with similar histograms are more likely to correspond to the same scene than frames with dissimilar histograms.
The difference between two histograms can be represented mathematically using metrics called dissimilarity measures. One possible dissimilarity measure D.sub.abs (m,n) corresponds to the sum of the absolute values of the differences between corresponding histogram bin values for two frames m and n, which can be represented by Equation (1) as follows: EQU D.sub.abs (m,n).tbd..SIGMA..sub.i .vertline.hd.sub.i (m)-h.sub.i (n).vertline. (1)
where h.sub.i (m) is the i.sup.th bin value of the histogram corresponding to frame m and h.sub.i (n) is the i.sup.h bin value of the histogram corresponding to frame n. Another possible dissimilarity measure D.sub.sq (m,n) corresponds to the sum of the squares of the differences between corresponding histogram bin values, which can be represented by Equation (2) as follows: EQU D.sub.sq (m,n).tbd..SIGMA..sub.i .vertline.h.sub.i (m)-h.sub.i (n).vertline..sup.2 (2)
Each of these dissimilarity measures provides a numerical representation of the difference, or dissimilarity, between two frames.
A cut is an abrupt transition between two consecutive shots, in which the last frame of the first shot is followed immediately by the first frame of the second shot. Such abrupt transitions can be detected with good reliability by analyzing histogram differences between each pair of consecutive frames in the video sequence, and identifying those pairs of consecutive frames whose histogram differences exceed a specified (and typically empirically generated) threshold as corresponding to cuts in the video sequence.
Unlike cuts, gradual transitions in video sequences, such as dissolves, fades, and wipes, cannot be detected with adequate reliability using histogram differences based on consecutive frames. A dissolve is a transition between two shots in which the pixel values of the second shot steadily replace the pixel values of the first shot over a number of transition frames, where the individual pixels of the transition frames are generated by blending corresponding pixels from the two different shots. A blended pixel p.sub.b may be defined by Equation (3) as follows: EQU P.sub.b .tbd.(1-f)*p.sub.1 +f*p.sub.2 .ltoreq.f.ltoreq.1, (3)
where p.sub.1 is the corresponding pixel from the first shot, p.sub.2 is the corresponding pixel from the second shot, and f is the blending fraction. As a dissolve progresses from one transition frame to the next, the blending fraction f is increased. As such, the influence of the pixels from the first shot in the blending process gets smaller, while the influence of the pixels from the second shot gets greater, until, at the end of the transition, each pixel corresponds entirely to a pixel from the second shot.
A fade is a transition in which the pixels of a shot are gradually changed over a number of transition frames until the image is all one color. For example, in a fade to black, the intensities or luminances of the pixels are steadily decreased until the image is all black. A fade can be viewed as a particular type of dissolve in which the second shot corresponds to a solid color.
A wipe is a transition between two shots in which the pixels of the second shot steadily replace the pixels of the first shot over a number of transition frames, where the individual pixels of the transition frames are generated by selecting pixel values from either the first shot or the second shot. As a wipe progresses from one transition frame to the next, the number of pixel values selected from the first shot decreases, while the number of pixel values from the second shot increases, until, at the end of the transition, all of the pixels are from the second shot.
In gradual transitions, such as dissolves, fades, and wipes, the histogram differences between consecutive transition frames, as measured by dissimilarity measures such as those defined in Equations (1) and (2), might not be significantly different from the histogram differences between consecutive frames within a given shot. As such, using histogram differences based on consecutive frames will not provide a reliable measure for detecting gradual transitions in video sequences.
To detect gradual transitions, Zhang, et al., "Automatic partitioning of full-motion video," Multimedia Systems, Vol. 1, No. 1, pp. 10-28 (1993), introduce a twin-comparison method that has two thresholds. This method first evaluates the dissimilarity between adjacent frames and uses the lower of the two thresholds to indicate the potential starting frame s of a gradual transition. It then evaluates D(s,s+k) for each frame s+k for which D(s+k-1,s+k) exceeds this lower threshold. A gradual transition is identified when D(s,s+k) exceeds the higher of the two thresholds (i.e., similar to one used for an abrupt transition). Despite some refinements to allow for variations between adjacent frames, this method is confounded by motion. Camera motion can be identified by motion analysis, but the impact of object motion is much more difficult.
Yeo, et al., "Rapid Scene Analysis on Compressed Video," IEEE Transactions on Circuits and Systems for Video Technology, Vol. 5, No. 6, pp. 533-544 (1995) detect transitions by evaluating D(n-w,n) for all n, where the fixed timing window w is selected to be significantly larger than the number of frames in a transition. This method is resistant to motion, but requires some knowledge of expected transition lengths.
Shahraray, "Scene change detection and content-based sampling of video sequences," Digital Video Compression: Algorithms and Technologies 1995, Proc. SPIE 2419, pp. 2-13 (1995), applies a recursive temporal filter to D(n-w,n), where w may exceed 1. His analysis includes order statistics and a motion indicator.
Alattar, "Detecting and Compressing Dissolve Regions in Video Sequences with a DVI Multimedia Image Compression Algorithm," Proceedings of the IEEE International Symposium on Circuits and Systems, May 3-6, 1993, Chicago, Ill., pp. 13-16, uses the variance of pixel values to identify dissolves. This method is also resistant to motion but does not appear to apply to other types of gradual transitions, such as wipes.