Because of the huge number of frames in a typical video sequence, it is necessary in many applications to select a small number of characteristic frames to represent the larger sequence. Such frames are known in the art as representative frames, or r-frames. They are used, for example, in multimedia indexing and retrieval systems (MIRS) and in video archives, in order to facilitate efficient search and recall of video information. An overview of these applications is provided by Lu in Multimedia Database Management Systems (Artech House, 1999), which is incorporated herein by reference. A typical method for indexing a video database in this manner is described in U.S. Pat. No. 5,485,611, which is likewise incorporated herein by reference. R-frames can also be used for video compression at low bit rates, by encoding only a representative subset of the original video sequence.
In order for a video processing computer to choose the proper r-frames in a sequence, it is generally necessary first for the computer to divide the sequence into segments. Most of the work that has been done on automatic video sequence segmentation has focused on identifying shots. A shot is a group of sequential frames depicting continuous action in time and space. Methods for detecting shot transitions are described, for example, by Sethi et al., in “A Statistical Approach to Scene Change Detection,” published in Proceedings of the Conference on Storage and Retrieval for Image and Video Databases III (SPIE Proceedings 2420, San Jose, Calif., 1995), pages 329-338, which is incorporated herein by reference. Further methods for finding shot transitions and identifying r-frames within a shot are described in U.S. Pat. Nos. 5,245,436, 5,606,655, 5,751,378, 5,767,923 and 5,778,108, which are also incorporated herein by reference.
When a shot is taken with a stationary camera and not too much action, a single r-frame will generally represent the shot adequately. When the camera is moving, however, there may be big differences in content between different frames in a single shot. Therefore, a better representation of the video sequence can be achieved by grouping frames into smaller segments that have similar content. An approach of this sort was adopted, for example, in U.S. Pat. No. 5,635,982, which is incorporated herein by reference. This patent describes an automatic video content parser, used to perform video segmentation and key frame (i.e., r-frame) extraction for video sequences having both sharp and gradual transitions. The system analyzes the temporal variation of video content and selects a key frame once the difference of content between the current frame and a preceding key frame exceeds a set of preselected thresholds. In other words, for each of the segments found by the system, the first frame in the segment is the r-frame, followed by a group of subsequent frames that are not too different from the r-frame.
Another approach to r-frame selection is described by Zhuang et al., in “Adaptive Key Frame Extraction Using Unsupervised Clustering,” in Proceedings of the IEEE International Conference on Image Processing (Chicago, October, 1998), pages 866-870, which is incorporated herein by reference. The authors divide each shot in a video sequence into one or more clusters of frames that are similar in visual content, but are not necessarily sequential. For example, the frames may be clustered according to characteristics of their color histograms, with frames from both the beginning and the end of a shot being grouped together in a single cluster. A centroid of the clustering characteristic is computed for each cluster, and the frame that is closest to the centroid is chosen to be the key frame for the cluster.