Multimedia sources of information such as video programs are one form of multimedia data composed of at least two distinct media components. For example, a video program is composed of a full motion video component and an audio component. A number of methods are known for reducing the large storage and transmission requirements of the video component of video programs. For example, certain compression methods (such as JPEG) take advantage of spatial redundancies that exist within an individual video frame to reduce the number of bytes required to represent the frame. Additional compression may be achieved by taking advantage of the temporal redundancy that exists between consecutive frames, which is the basis for known compression methods such as MPEG. These known compression methods generate a fixed number of frames per unit time to preserve the motion information contained in the video program.
In contrast to the compression methods mentioned above, other methods compress video programs by selecting certain frames from the entire sequence of frames to serve as representative frames. For example, a single frame may be used to represent the visual information contained in any given scene of the video program. A scene may be defined as a segment of the video program over which the visual contents do not change significantly. Thus, a frame selected from the scene may be used to represent the entire scene without losing a substantially large amount of information. A series of such representative frames from all the scenes in the video program provides a reasonably accurate representation of the entire video program with an acceptable degree of information loss. These compression methods in effect perform a content-based sampling of the video program. Unlike the temporal or spatial compression methods discussed above in which the frames are uniformly spaced in time, a content-based sampling method performs a temporally non-uniform sampling of the video program to generate a set of representative frames. For example, a single representative frame may represent a long segment of the video program (e.g., a long scene in which a person makes a speech without substantially changing position for an extended period) or a very short segment of the video program (e.g., a scene displayed in the video program for only a few seconds).
Methods for automatically generating representative images from video programs are known. These methods may detect the boundaries between consecutive shots and may additionally detect scene changes that occur within the individual shots. An example of a method for locating abrupt and gradual transitions between shots is disclosed in patent application Ser. No. 08/171,136, filed Dec. 21, 1993, and entitled "Method and Apparatus for Detecting Abrupt and Gradual Scene Changes In Image Sequences," the contents of which are hereby incorporated by reference. A method for detecting scene changes that occur within individual shots has been disclosed in patent application Ser. No. 08/191,234, filed Feb. 4, 1994, entitled "Camera-Motion Induced Scene Change Detection Method and System," the contents of which are also hereby incorporated by reference.
Content-based sampling methods are typically employed for indexing purposes because the representative frames generated by such methods can efficiently convey the visual information contained in a video program. However, these methods fail to convey all the useful information contained in a multimedia format such as video because they only compress one media component, namely, in the case of video, the video component, while excluding the remaining media component (e.g., audio) or components.