In the art of digital video processing, a sequential group of video frames that appears to be captured from the same camera in a continuous or static fashion is known as a shot. Shot boundary detection is the process of identifying the frames that are at the boundaries of the individual shots into which a particular video sequence is segmented. Once identified by their boundaries, shots are usable for a number of practical video content retrieval and representation applications, including video storyboarding, video content searching and browsing, video analysis, and other applications.
Fundamentally, shot boundary estimation involves determining a level of dissimilarity between the content of adjacent frames in a video sequence and in the event that a determined level of dissimilarity exceeds a threshold, identifying the corresponding frames as the ending and beginning frames of respective shots.
Methods for estimating similarities between images are known. For example, U.S. Pat. No. 6,373,979 to Wang discloses a system and method for determining a level of similarity among more than one image. Anticipated spatial characteristics of an image are used for automatically identifying segments within the image and for identifying weights to be added to the color characteristics associated with the identified segments. To determine similarity, comparisons are made between weighted color characteristics of corresponding segments of different images. The identified segments have attributes such as size, position and number which are based on the anticipated spatial characteristics of the image. The anticipated spatial characteristics include, among other things, differences in image characteristics that are anticipated at relative positions within the image.
Methods of segmenting video for classification are also known. For example, U.S. Patent Application Publication No. US2004/0170321 to Gong et al. discloses a technique for video segmentation, classification and summarization based on singular value decomposition. Frames of an input video sequence are represented by vectors composed of concatenated histograms descriptive of the spatial distributions of colors within the video frames. The singular value decomposition maps these vectors into a refined feature space. In the refined feature space produced by the singular value decomposition, a metric is used to measure the amount of information contained in each video shot of the video sequence. The most static video shot is defined as an information unit and the content value computed from the shot is used as a threshold to cluster the remaining frames. The clustered frames are displayed using a set of static keyframes. The video segmentation technique relies on the distance between the frames in the refined feature space to calculate the similarity between frames in the video sequence. The video sequence is segmented based on the values of the calculated similarities. Average video attribute values in each segment are used in classifying the segments.
U.S. Pat. No. 6,195,458 and U.S. Patent Application Publication No. US2001/0005430 to Warnick et al. disclose methods for content-based temporal segmentation of video. A plurality of type-specific individual temporal segments within a video sequence is identified using a plurality of type-specific detectors. The plurality of type-specific individual temporal segments that are identified, are analyzed and refined. A list of locations within the video sequence of the identified type-specific individual temporal segments is output. A frame-to-frame color histogram difference metric is used to measure color similarities in adjacent frames. This is supported with a pixel difference metric to measure spatial similarity of adjacent frames. These processes are performed in parallel and the results are used to determine shot boundaries.
U.S. Pat. No. 5,911,008 to Niikura et al. discloses a scheme for estimating shot boundaries in compressed video data at high speed and accuracy. A predictive-picture (P-picture) change is calculated from a P-picture sequence in the input video data which is compressed by an inter-frame/inter-field forward direction prediction coding scheme, according to coded data contained in the P-picture sequence. An intra-picture (I-picture) change is calculated from an I-picture sequence in the input video data which is compressed by an intra-frame/intra-field coding scheme. A shot boundary is detected by evaluating both the P-picture change and the I-picture change.
U.S. Pat. No. 6,393,054 to Altunbasak et al. discloses a system for estimating a shot boundary in compressed video data without decompression. The content differences between frames are compared with thresholds to detect sharp or gradual shot boundaries. Keyframes are extracted from detected shots using an iterative algorithm.
While, as discussed above, there are known techniques for automatically estimating shot boundaries in a digital video sequence, improvements in accuracy and efficiency are desired. For example, a common shortfall with known methods for shot boundary estimation is that the chosen comparison technique for measuring dissimilarity between two frames does not accurately reflect the content differences of the frames. While known color comparison techniques, such as bin-to-bin absolute difference, chi-square histogram difference and the average color method proposed by Hafner, et al. in the publication entitled “Efficient Color Histogram Indexing for Quadratic Form Distance Functions”, IEEE Transactions on Pattern Analysis Machine Intelligence., vol. 17, 1995 are sufficient, improvements are desired for more closely correlating dissimilarity measurements with true content differences.
Known color histogram difference comparison methods have also suffered as a result of inappropriate frame dissimilarity thresholds being selected during shot boundary identification. For example, selection of a single universal threshold tends to limit the effectiveness of shot boundary estimation to particular types of video sequence content (i.e. one of news, animation, or drama, etc.), because different content types inherently differ in the dissimilarity threshold levels that define respective shot boundaries. A related problem when using a universal threshold is that even within a video sequence of a particular type, content between shots can vary significantly. In order to address this problem, it has been proposed to continuously adapt the dissimilarity threshold frame by frame on the basis of the average inter-frame dissimilarity of frame pairs in a small window local to each frame. However, often any accuracy gains from this simple adaptive approach are lost due to fluctuating noise in a video sequence, or the side-effects of video compression.
Abrupt special effects that cause luminance anomalies in certain frames, such as those due to explosions and flashbulbs, are also known to cause false estimation of shot boundaries, particularly when color comparison methods are employed. In the publication entitled “A New Shot Boundary Detection Algorithm” authored by Zhang et al., Proceedings of 2nd IEEE Pacific-Rim Conference on Multimedia (PCM 2001), a flashlight detector is integrated into a twin histogram shot boundary estimation algorithm. Flash effects are detected by finding the ratio between the average intensity difference of adjacent frames and the average intensity difference of a window of frames before and after the current frame. However, the flashlight detector tends to yield inconsistent results in situations where either sharp intensity fluctuations exist in a short sequence of frames, or a special effect lasts for several frames (i.e., long explosions, fireworks, and quick consecutive flashes).
As with many video processing techniques, a key concern with shot boundary estimation is performance. While many known techniques can provide accurate results in many situations, they have accordingly large computational and memory requirements. Examples of such computationally expensive algorithms are described in the publications entitled “Shot Boundary Detection Using Temporal Statistics Modeling”, authored by Liu et al., IEEE International Conference on Acoustics, Speech, and Signal Processing 2002, pp. 3389-3392 vol. 4, 2002, and “Evaluating and Combining Digital Video Shot Boundary Detection Algorithms” authored by Browne et al., Proceedings of IMVIP 2000, Belfast, Northern Ireland, September 2000. As will be understood, for use in practical systems it is important for shot boundary estimation to balance accuracy and computational expense in a feasible manner.
It is therefore an object of the invention to provide a novel method and apparatus for estimating shot boundaries in a digital video sequence.