Digital video is a rapidly growing element of the computer and telecommunication industries. Many companies, universities and even families already have large repositories of videos both in analog and digital formats. Examples include video used in broadcast news, training and education videos, security monitoring videos, and home videos. The fast evolution of digital video is changing the way many people capture and interact with multimedia, and in the process, it has brought about many new needs and applications.
One such application is video abstraction. Video abstraction, as the name implies, is a short summary of the content of a longer video sequence which provides users concise information about the content of the video sequence, while the essential message of the original is well preserved. Theoretically, a video abstract can be generated manually or automatically. However, due to the huge volumes of video data already in existence and the ever increasing amount of new video data being created, it is increasingly difficult to generate video abstracts manually. Thus, it is becoming more and more important to develop fully automated video analysis and processing tools so as to reduce the human involvement in the video abstraction process.
There are two fundamentally different kinds of video abstracts: still-image abstracts and moving-image abstracts. The still-image abstract, also called a video summary, is a small collection of salient images (known as keyframes) extracted or generated from the underlying video source. The moving-image abstract, also called video skimming, consists of a collection of image sequences, as well as the corresponding audio abstract extracted from the original sequence and is thus itself a video clip but of considerably shorter length. Generally, a video summary can be built much faster than the skimming, since only visual information will be utilized and no handling of audio or textual information is necessary. Consequently, a video summary can be displayed more easily since there are no timing or synchronization issues. Furthermore, the temporal order of all extracted representative frames can be displayed in a spatial order so that the users are able to grasp the video content more quickly. Finally, when needed, all extracted still images in a video summary may be printed out very easily.
While the use of video summarization is applicable to video sequences in any storage medium (tape, disc, etc.), one common storage medium of interest is DVD video discs. DVD video is dramatically changing the way people utilize multimedia information. The huge storage capacity of a DVD video disc provides an ideal storage place for still images, text, video and audio. The navigation features supported by DVD video format enable the interactive access of media contents. To accommodate the various media types that can be stored on DVD disc, there is an increasing need for a technology that can organize the media according to the DVD video format specifications and export such organized media content to the DVD disc. This technology is generally called “DVD authoring” and one essential task of DVD authoring is to create the DVD video title and navigation structure from the video source.
The DVD video title structure consists primarily of two entities, titles and chapters, which are used to organize the video content for interactive browsing. The format of a DVD disc allows the DVD disc to contain up to 99 titles, and a title may contain up to 99 chapters. The titles and chapters thus segment the entire video sequence into meaningful pieces with each title and/or chapter being an entry point for one particular piece of video.
To be able to automatically create the title-and-chapter structure from a video sequence is of great interest in DVD authoring. For example, in Hewlett Packard's MyDVD application, when a user elects to have a DVD created automatically from a video, a new chapter is created when a scene is detected based on a scene detection algorithm. A keyframe is then extracted from each detected scene. The keyframe, which represents the underlying scene, is linked to a DVD navigation button so that the user can browse the keyframes to quickly capture the content of the video sequence and click the relevant button to watch the corresponding scene.
Occasionally, the number of detected scenes may be larger than the number of chapters that is preferred or allowed. Therefore, a method for intelligently merging the detected scenes is needed. Further, if a chapter contains several original scenes which have been merged, a method for constructing a meaningful and informative keyframe to represent the underlying merged video content is needed.