1. FIELD OF THE INVENTION
The present invention is generally related to the field of video image processing. The present invention relates to the field of processing video for the purpose of automatically locating specific content. Specifically, the present invention pertains to the selection of keyframes from a video which are used to represent or summarize the visual content of the video, so that the keyframes may be used by any one of a variety of applications which utilize keyframes for various purposes.
2. DISCUSSION OF THE RELATED ART
When reviewing collections of videos such as recorded meetings or presentations, users are often interested only in an overview of these documents. At FX Palo Alto Laboratory, weekly staff meetings and other seminars and presentations are held in a conference room outfitted with several video cameras. All formal meetings and most presentations are videotaped, MPEG-encoded, and made available to the staff via the company intranet. These videos amount to about three hours per week; more than 150 hours of video exist in the database. It is often difficult to find both the appropriate video file and the portion of the video that is of interest. As video is used more and more as a permanent record for decisions made in meetings and video conferences, it becomes more important to locate the passages containing relevant information or even the meeting in which a decision was made. It is desirable to help users locate specific video passages quickly and provide the users with visual summaries of the videos.
Keyframes are used to distinguish videos from each other, to summarize videos, and to provide access points into them. Well-chosen keyframes help video selection and make the listing more visually appealing. However, it is difficult to determine a single frame that the best represents the whole video. It also is difficult to distinguish videos based on a single keyframe, so it is desirable to provide a number of keyframes. As is apparent from the above discussion, a need exists for determining a set of keyframes that describes an entire video clip well.
Most of the related art has been applied to professionally produced material such as movies, TV comedies, and news programs. That art has concentrated on breaking video into shots and then finding keyframes corresponding to those shots. The results of that art are not directly applicable to the applications of the methods of the present invention. First, videotaped meetings and presentations are produced in a more ad hoc fashion so that it is not reasonable to rely on established production practices. Second, using one or more keyframes from each shot produces more keyframes than needed for many application.
Many of the conventional systems described in the literature use a constant number of keyframes for each detected shot. Some use the first frame of each shot as a keyframe. Others represent shots with two keyframesxe2x80x94the first and last frames of each shot. Others use clustering on the frames within each shot. The frame closest to the center of the largest cluster is selected as the keyframe for that shot. Some generate a composite image to represent shots with camera motion.
Other conventional systems use more keyframes to represent shots that have more interesting visual content. Some segment the video into shots and select the first clean frame of each shot as a keyframe. Other frames in the shot that are sufficiently different from the last keyframe are marked as keyframes as well.
One way to reduce the number of keyframes is to remove redundancies. One conventional approach selects one keyframe for each video shot. These keyframes are then clustered based on visual similarity and temporal distance. Since their purpose is to group shots to determine video structure, the temporal constraints are used to prevent keyframes that occur far apart in time from being grouped together.
A conventional system divides a video into intervals of equal length and determine the intervals with the largest dissimilarity between the first and last frame. All frames from those intervals are kept whereas only two frames from each of the remaining intervals is kept. The process is repeated until the desired number of frames or less is left. This approach takes only fairly localized similarities into consideration and cannot apply constraints for frame distribution or minimum distance.
A conventional system also provides an alternate representation of a video sequence that uses keyframes that are evenly spaced, ignoring shot boundaries.
The conventional systems do not meet the goal of extracting an exact number of representative keyframes. Existing systems either provide only limited control over the number of keyframes or do not perform an adequate job of finding truly representative frames. In addition, other systems do not apply temporal constraints for keyframe distribution and spacing.
Conventional evenly spaced keyframes do not provide sufficient coverage of video content. Thus, as is apparent from the above discussion, a need exists for a keyframe selection method which provides sufficient coverage of video content.
In accessing large collections of digitized videos, it is conventionally difficult to find both the appropriate video file and the portion of the video that is of interest. Keyframes are used in many different applications to provide access to video. However, most conventional algorithms do not consider time. Also, most conventional keyframe selection approaches first segment a video into shots before selecting one or several keyframes for every shot. According to the present invention, time constraints are placed on the selected video frames because they align keyframes spatially to a timescale. According to the present invention, selecting candidate frames does not require any explicit prior shot segmentation. Instead, a number of candidate boundaries much larger than the actual number of shot boundaries is determined and the frames before and after those boundaries are selected. The methods of the present invention gracefully deals with significant changes during a shot without missing important keyframes. While most conventional keyframe selection algorithms select at least one keyframe per shot, the method according to the present invention selects far fewer keyframes than the number of shots by returning exactly the requested number of keyframes. The method according to the present invention selects keyframes from the candidate frames using a hierarchical clustering method.
A method for selecting keyframes based on image similarity produces a variable number of keyframes that meet various temporal constraints. A hierarchical clustering approach determines exactly as many clusters as requested keyframes. Temporal constraints determine which representative frame from each cluster is chosen as a keyframe. The detection of features such as slide images and close-ups of people are used to modify the clustering of frames to emphasize keyframes with desirable features.
The present invention includes a method for determining keyframes that are different from each other and provide a good representation of the whole video. Keyframes are used distinguish videos from each other, to summarize videos, and to provide access points into them. The methods of the present invention determine any number of keyframes by clustering the frames in a video and by selecting a representative frame from each cluster. Temporal constraints are used to filter out some clusters and to determine the representative frame for a cluster. Desirable visual features are emphasized in the set of keyframes. An application for browsing a collection of videos makes use of the keyframes to support skimming and to provide visual summaries.
According to an aspect of the present invention, a method for candidate frame selection involves sampling the source frames of the source video at a predetermined fixed periodic interval. Preferably, the fixed periodic interval is a function of the type of the video, and preferably ranges from about 0.2 to 0.5 seconds. A frame difference is computed for each sampled frame which indicates the difference between the sampled frame and the previous sampled frame. The largest frame differences represent candidate boundaries, and the frames before and after the N/2 largest candidate boundaries are selected as candidate frames in order to achieve up to N candidate frames. Optionally, the distance measure is modified according to the class membership of the frames. The class membership of the frames is optionally computed statistically from image class statistical models.
According to yet another aspect of the present invention, a method for selecting keyframes involves clustering all candidate frames into a hierarchical binary tree using a hierarchical agglomerative clustering algorithm. Initially, all frames are deemed single-frame clusters. The two clusters having the lowest maximum pairwise distance between any two frames (one frame from each of the two clusters) become the two constituent clusters of a larger cluster. The clustering continues until a single root cluster contains all the candidate frames. Optionally, the pairwise distances for the members of the two clusters is modified according to class membership of the members, which is preferably determined statistically from image class statistical models.
According to yet another aspect of the present invention, a method for selecting M clusters from which keyframes are extracted involves splitting the M-1 largest clusters of a hierarchical binary tree of clusters. The size of a cluster is determined by the number of frames within all the sub-clusters contained in the cluster. Optionally, clusters not having at least one uninterrupted sequence of frames of at least a minimum duration are filtered out. Clusters representing single frames are preferably filtered out because they are likely to represent video artifacts such as distortions.
According to still another aspect of the present invention, a method for selecting keyframes applies temporal constraints in order to attempt to guarantee keyframe inclusion in all portions of a video, and to guarantee at least a minimum separation between keyframes. The source video duration is divided into equal duration intervals. If an interval has no keyframes, all other intervals having at least two keyframes are inspected in descending keyframe count order to attempt to find a keyframe within a cluster that has a member within the interval that does not have any keyframes. If such a keyframe is found, the member is substituted as the keyframe for the cluster, thereby spreading out the keyframe distribution. In order to guarantee minimum keyframe separation, the minimum time between any two keyframes is determined. If this minimum time is less than a minimum time threshold, an attempt is made to find another keyframe from one or both of the two clusters which the two conflicting keyframe belong to. If a substitute cannot be found, one of the conflicting keyframes is deleted.
Thus, a variable number of distinct keyframes that provide a good representation of all the frames in the video is determined according to the present invention. According to the present invention, hierarchical clustering is performed and single frames are selected from each cluster. In an alternative, if more or fewer keyframes are desired by the user or application, the number of clusters is simply increased or decreased according to the present invention. According to the present invention, temporal constraints are used to filter out unsuitable clusters and to select a representative frame for each cluster. The present invention uses temporal constraints to prevent keyframes from appearing too close together in time.
An application using the keyframe extraction mechanism allows users to access a collection of video-taped staff meetings and presentations. The keyframe skimming interface greatly simplifies the task of finding the appropriate video and getting an overview of it. These and other aspects, features, and advantages of the present invention will be apparent from the Figures, which are fully described in the Detailed Description of the Invention.