Digital video is a rapidly growing element of the computer and telecommunication industries. Many companies, universities and even families already have large repositories of videos both in analog and digital formats. Examples include video used in broadcast news, training and education videos, security monitoring videos, and home videos. The fast evolution of digital video is changing the way many people capture and interact with multimedia, and in the process, it has brought about many new needs and applications.
One such application is video abstraction. Video abstraction, as the name implies, is a short summary of the content of a longer video sequence which provides users concise information about the content of the video sequence, while the essential message of the original is well preserved. Theoretically, a video abstract can be generated manually or automatically. However, due to the huge volumes of video data already in existence and the ever increasing amount of new video data being created, it is increasingly difficult to generate video abstracts manually. Thus, it is becoming more and more important to develop fully automated video analysis and processing tools so as to reduce the human involvement in the video abstraction process.
There are two fundamentally different kinds of video abstracts: still-image abstracts and moving-image abstracts. The still-image abstract, also called a video summary, is a small collection of salient images (known as keyframes) extracted or generated from the underlying video source. The moving-image abstract, also called video skimming, consists of a collection of image sequences, as well as the corresponding audio abstract extracted from the original sequence and is thus itself a video clip but of considerably shorter length than the video sequence from which it is derived. Generally, a still image abstract is easier and faster to create than a moving image abstract, since only visual information is utilized to generate the still image abstract, whereas a moving image abstract is created by incorporating/synchronizing audio or textual information into the moving abstract. Furthermore, the temporal order of all extracted representative frames can be displayed in a spatial order so that the users are able to grasp the video content more quickly from the still image abstract. Finally, when needed, the extracted still images can be easily printed out from the still image abstract.
While the use of video summarization is applicable to video sequences in any storage medium (tape, disc, etc.), one common storage medium of interest is DVD video discs. DVD video is dramatically changing the way people utilize multimedia information. The huge storage capacity of a DVD video disc provides an ideal storage place for still images, text, video and audio. The navigation features supported by DVD video format enable the interactive access of media contents. To accommodate the various media types that can be stored on DVD disc, there is an increasing need for a technology that can organize the media according to the DVD video format specifications and export such organized media content to the DVD disc. This technology is generally called “DVD authoring” and one essential task of DVD authoring is to create the DVD video title and navigation structure from the video source.
As explained in greater detail below, FIGS. 1a and 1b are illustrations of a video summary 10 as it may appear on a video display device 1, such as on a computer monitor or television. In the example illustrated, a video summary 10 of a video source is shown. The video source may be, for example, a DVD, a video stored on a hard drive, a video tape, or any other video storage medium. The video summary 10 structure consists primarily of two entities: titles 12 and chapters 14. Titles 12 and chapters 14 are used to organize the video content of the video source for interactive browsing. Titles 12 and chapters 14 segment the entire video sequence of the video source into meaningful pieces, with each title 12 and chapter 14 being an entry point for a particular piece of video.
In the example illustrated in FIGS. 1a and 1b, the video summary 10 may be used for browsing the entire content of the video source and quickly locating a desired section of the video. For example, when browsing the video, the user first sees the display of FIG. 1a, showing that the video is segmented into smaller sections (i.e., titles 12) labeled “Trip A”, “Trip B” and “Trip C”. The video content of each title 12 is determined by the particular video summarization technique used. To browse the content of the video to a greater level of detail, the user selects a title 12 of interest (in this example, the title 12 labeled “Trip B”). The chapters 14 that make-up the content of the selected title 12 are then displayed, as in FIG. 1b, where the title 12 labeled “Trip B” is shown to include chapters 14 labeled “Location A” and “Location B”. The user may select a chapter 14 to view the underlying video sequence.
As shown in FIGS. 1a and 1b, each graphical representation of a title 12 or chapter 14 typically constitutes two parts: a representative image or keyframe 16 and a text label 18. The ideal keyframe 16 and text label 18 should capture the underlying video content and convey to the user what is in the underlying video document. Thus, it is desirable that the keyframe 16 and text label 18 capture the semantic focus of the video content. This is a challenging issue in the research areas of video analysis and video summarization.
To be able to automatically create the title-and-chapter structure with a meaningful representative image or keyframe 16 and a corresponding text label 18 from a video sequence is of great interest in DVD authoring. Applications which automatically select a representative image from a video document are known in the art. For example, in Hewlett Packard's MyDVD application, when a user elects to have a DVD created automatically from a video, a new chapter is created when a scene is detected based on a scene detection algorithm. A keyframe 16 is then extracted from each detected scene. The keyframe 16, which represents the underlying scene, is linked to a DVD navigation button so that the user can browse the keyframes 16 to quickly capture the content of the video sequence and click the relevant button to watch the corresponding scene.
Presently, labels 18 for annotating selected keyframes 16 are created manually or from video analysis. The manual creation is both time consuming and subjective, while creation from video analysis tends to be unreliable due to limitations in the video summarization algorithms. The process of video summarization is also slowed by either creating labels manually or requiring extensive video analysis. Thus, a need still exists for automatically creating a meaningful text label 18 to accompany the selected keyframe 16 images in a video summarization.