1. Technical Field
The present invention relates generally to digital video processing and analysis and, more specifically, to a system and method for selecting key-frames from video data based on quantifiable measures such as the amount of motion and/or color activity of the video data.
2. Description of Related Art
The use of digital video in many multimedia systems is becoming quite popular. Videos are playing an increasingly important role in both education and commerce. In addition to the currently emerging services such as video-on-demand and pay-television, a variety of new information services such as digital catalogues and interactive multimedia documents, including text, audio and video are being developed.
Some conventional digital video application, however, use time consuming fast forward or rewind methods to search, retrieve and obtain a quick overview of the video content. As such, methods are continuously being developed for accessing the video content, which present the visual information in compact forms such that the operator can quickly browse a video clip, retrieve content in different levels of detail and locate segments of interest. To enable time-efficient access, digital video must be analyzed and processed to provide a structure that allows the user to locate any event in the video and browse it very quickly.
In general, a widely used method to provide the aforementioned needs is to generate a video summary. Conventional video summarization methods typically include segmenting a video into an appropriate set of segments such as video xe2x80x9cshotsxe2x80x9d and selecting one or more key-frames from the shots. A video shot refers to a contiguous recording of one or more video frames depicting a continuous action in time and space. In a shot, the camera could remain fixed, or it may exhibit one of the characteristic motions such as panning, tilting, or tracking. For most videos, shot changes or cuts are created intentionally by video/film directors. Since there are typically many images in a given video shot, it is desirable to reduce the number of such images to one or more key-frames to represent the content of a given shot.
Conventional methods for selecting key-frames may be broadly classified into three categories: uniform sampling, color and motion based methods. Conventional methods based on uniform sampling select the frame(s) at certain instances in the shot as key-frames for the shot. For instance, one common practice is to select only one frame as a key-frame, i.e., the nth frame of the video shot where n is predetermined (which is typically the first frame n=1), to represent content of the video shots. Generally speaking, in a video shot where object and/or camera motion, and visual effects are prominent, one representative image is typically not sufficient to represent the entire content in the video. As such, other conventional key-frame selection methods based on uniform sampling of the images select multiple key-frames within the video shot by selecting those frames that exist at constant time-intervals in a video shot. Irrespective of the content in the video shot, however, this method yields multiple key-frames. However, when the shot is stationary all the other key-frames except the first key-frame will be redundant.
The problem with the uniform sampling methods is that the viewer may be misguided about the video content when there is high motion and/or color activity in the shot due to the well-known uniform sampling problem, i.e. aliasing.
Conventional key-frame selection methods based on color typically employ histogram or wavelet analysis to find the similarity between consecutive frames in a video shot. For example, one conventional method involves comparing a current frame in the video shot with the previous key-frame in terms of the similarity of their color histograms or luminance projections starting from the first frame of the shot where this frame is selected as the first key-frame. The frame with a similarity value smaller than a predetermined threshold is selected as the next key-frame. The selection is terminated when the end of the shot is reached (see, xe2x80x9cEfficient Matching and Clustering of Video Shots,xe2x80x9d by B. Liu, et al, Proc. IEEE ICIP, Vol. 1, pp. 338-341, Washington D.C., October, 1995.) A similar method uses chromatic features to compute the image similarity (see xe2x80x9cA Shot Classification Method Of Selecting Effective Key-Frames For Video Browsing,xe2x80x9d by H. Aoki, et al, Proc. ACM Multimedia, pp., 1-10, Boston, Mass., November, 1996.
One problem with histogram-based methods is that they typically fail to select key-frames when the spatial layout of the content changes while the histogram remains constant. As such, other conventional methods use wavelet coefficients or pixel-based frame differences to compute the similarity between frames to handle spatial layout problem.
Another conventional key-frame selection method disclosed in U.S. patent application Ser. No. 5,635,982 entitled: xe2x80x9cSystem For Automatic Video Segmentation and Key Frame Extraction For Video Sequences Having Both Sharp and Gradual Transitions.xe2x80x9d With this method, starting from the first frame in the shot, the frame is compared with the rest of the frames in the shot until a significantly different frame is found, and that image is selected as a candidate against which successive frames are compared. The method suggest using one of the following metrics: color, motion or hybrid (color and motion combined), to compute the frame similarity. Another conventional method disclosed in U.S. Pat. No. 5,664,227 entitled: xe2x80x9cSystem and Method For Skimming Digital Audio/video Dataxe2x80x9d employs a statistical change detection method that uses DCT (discrete cosine transform) coefficients of the compressed images as a similarity measure to select multiple key-frames in a video shot. This method also requires selection a threshold.
All the above methods use some form of statistical measure to find the dissimilarity of images and heavily depend on the threshold selection. One problem associated with such an approach is that selecting the appropriate threshold that will work for every kind of video is not trivial since these thresholds cannot be linked semantically to events in the video, but rather only used to compare statistical quantities. Although domain specific threshold selection is addressed in some of these conventional methods, the video generation techniques change over time. Yet there is a vast amount of sources in every domain. In addition color based similarity measures cannot quantify the dynamic information due to the camera or object motion in the video shot.
Conventional key-frame selection methods that are based on motion are better suited for controlling the number of frames based on temporal dynamics in the scene. In general, pixel-based image differences or optical flow computation are typically used in motion based key-frame selection methods. For instance, in one conventional method, a cumulative average image difference curve is sampled non-uniformly to obtain multiple key-frames (see xe2x80x9cVisual Search in a Smash Systemxe2x80x9d by R. L. Lagendijk, et al, Proc. IEEE ICIP, pp. 671-674, Lausanne, Switzerland, September 1996. This method, however, requires the pre-selection of the number of key-frames for a video clip. Another method uses optical flow analysis to measure the motion in a shot and select key-frames at the local minima of motion (see xe2x80x9cKey Frame Selection By Motion Analysis,xe2x80x9d by W. Wolfe, Proc. IEEE ICASSP, pp. 1228-1231, Atlanta Ga., May, 1996). Another conventional method involves using the cumulative amount of special motions for selection (see xe2x80x9cScene Change Detection and Content-based Sampling Of Video Sequences,xe2x80x9d by B. Shahraray, Proc. SPIE Digital Video Compressionxe2x80x9d Algorithms and Technologies, Vol. 2419, pp. 2-13, San Jose, Calif., February, 1995). Motion based key-frame selection methods model the temporal changes of the scene with motion only.
It is believed that that a good key-frame selection method is the one that can
1. exploit the dynamic information contained in videos due to camera and/or object motion, and visual effects to pick several key-frames as static image representations of the video to enhance the summary generation,
2. achieve a good balance preserving as much of the visual content and temporary dynamics in the shot as possible and minimize the number of key-frames needed for an efficient visual summary, and
3. adjust the performance according to the available computation power.
Accordingly, a key-frame selection process is desired that (1) combines the motion and color based key-frame selection methods and generalizes the solution without the need for statistically defined dissimilarity measures and, thus, highly sensitive threshold selection, and that (2) provides non-uniform sampling to overcome the aforementioned aliasing problems by selecting more frames when the color and motion activity is high, and less frames otherwise.
The present invention provides a system and method for selecting key-frames to generate a content-based visual summary of video and facilitate digital video browsing and indexing. This method is aimed at providing a real-time approach for key-frame selection irrespective of the available computation power. A method for selecting key-frames according to one aspect of the present invention is based on quantifiable measures such as the amount of motion and behavior of curves defined statistically or non-statistically, i.e. by finding the monotonically increasing segments of a curve, instead of thresholding the statistically defined image dissimilarity measures.
In one aspect of the present invention, a method for selecting key-frames from video data comprises the steps of: partitioning video data into segments; generating a temporal activity curve for dissimilarity measures based on one of frame differences, color histograms, camera motion, and-a combination thereof, for each segment; and sampling the temporal activity curve to select at least one key-frame for each segment.
In another aspect of the present invention, a temporal activity curve for dissimilarity measures based on frame differences is generated by computing an average of an absolute pixel-based intensity difference between consecutive frames in each segment and, for each segment, computing a cumulative sum of the average of the absolute pixel-based intensity differences for the corresponding frames of the segment. The key-frames in each segment are then selected by selecting the first frame in each motion activity segment of a given segment as a key-frame, if the cumulative sum of the average of the absolute pixel-based intensity differences for the frames of the given segment does not exceed a first predefined threshold; and selecting a predefined number of key-frames in the given segment uniformly, if the cumulative sum of the average of the absolute pixel-based intensity differences for the frames of the given segment exceeds the first predefined threshold.
In yet another aspect of the present invention, a temporal activity curve for dissimilarity measures based on camera motion is generated for each segment by estimating camera motion between consecutive frames in a given segment; computing a motion activity curve based on the estimated camera motion for the given segment; and computing a binary motion activity curve by comparing the motion activity curve to a second predefined threshold on a frame-by-frame basis. Preferably, the camera motion is estimated by estimating motion parameters such as zooming, rotational, panning, and tilting. The key-frames of a given segment are then selected by smoothing the binary motion activity curve of the given segment to detect motion activity segments within the given segment, selecting the first and last frame of each motion activity segment as a key-frame; and for each motion activity segment of the given segment, cumulatively summing the estimated camera motion of each frame and selecting at least one additional frame in each motion activity segment as a key-frame if the cumulative sum of the estimated camera motion exceeds a third predefined threshold.
In another aspect of the present invention, a temporal activity curve for dissimilarity measures based on color histograms is generated by computing a color histogram of each frame of a given segment; computing a moving average histogram of the given segment using the computed color histograms for each frame; generating a color histogram activity curve by computing a distance between the color histogram of each frame and the moving average histogram; and computing a binary color histogram activity curve by comparing a change in value of the color histogram activity curve between each consecutive frame of the given segment to a fourth predefined threshold value. The key-frames of a given segment are then selected by smoothing the binary color histogram activity curve to detect color histogram activity segments; and selecting at least one representative frame (preferably the first frame) of each color histogram activity segment as a key-frame for the given segment.
In yet another aspect of the present invention, the key-frame selection process comprises the step of eliminating selected key-frames in each segment that are visually similar. The key-frame elimination process is based on a comparison between histograms of the selected key-frames and/or a comparison of the spatial layout of the selected key-frames.
These and other objects, features and advantages of the present invention will be described or become apparent from the following detailed description of preferred embodiments, which is to be read in connection with the accompanying drawings.