Digital video is a rapidly growing element of the computer and telecommunication industries. Many companies, universities and even families already have large repositories of videos both in analog and digital formats. Examples include video used in broadcast news, training and education videos, security monitoring videos, and home videos. The fast evolution of digital video is changing the way many people capture and interact with multimedia, and in the process, it has brought about many new needs and applications.
Consequently, research and development of new technologies that lower the costs of video archiving, cataloging and indexing, as well as improve the efficiency, usability and accessibility of stored videos are greatly needed. One important topic is how to enable a user to quickly browse a large collection of video data, and how to achieve efficient access and representation of the video content while enabling quick browsing of the video data. To address these issues, video abstraction techniques have emerged and have been attracting more research interest in recent years.
Video abstraction, as the name implies, is a short summary of the content of a longer video document which provides users concise information about the content of the video document, while the essential message of the original is well preserved. Theoretically, a video abstract can be generated manually or automatically. However, due to the huge volumes of video data already in existence and the ever increasing amount of new video data being created, it is increasingly difficult to generate video abstracts manually. Thus, it is becoming more and more important to develop fully automated video analysis and processing tools so as to reduce the human involvement in the video abstraction process.
There are two fundamentally different kinds of video abstracts: still-image abstracts and moving-image abstracts. The still-image abstract, also called a video summary, is a small collection of salient images (known as keyframes) extracted or generated from the underlying video source. The moving-image abstract, also called video skimming, consists of a collection of image sequences, as well as the corresponding audio abstract extracted from the original sequence and is thus itself a video clip but of considerably shorter length. Generally, a video summary can be built much faster than the skimming, since only visual information will be utilized and no handling of audio or textual information is necessary. Consequently, a video summary can be displayed more easily since there are no timing or synchronization issues. Furthermore, the temporal order of all extracted representative frames can be displayed in a spatial order so that the users are able to grasp the video content more quickly. Finally, when needed, all extracted still images in a video summary may be printed out very easily
As a general approach to video summarization, the entire video sequence is often first segmented into a series of shots; then one or more keyframes are extracted from each shot by either uniform sampling or adaptive schemes that depend on the underlying video content complexity based on a variety of features, including color and motion. A typical output of these systems is a static storyboard with all extracted keyframes displayed in their temporal order. There are two major drawbacks in these approaches. First, while these efforts attempt to reduce the amount of data, they often only present the video content “as is” rather than summarizing it. Since different shots may be of different importance to users, it is preferably to assign more keyframes to important shots than to the less important ones. Second, a static storyboard cannot provide users the ability to obtain a scalable video summary, which is a useful feature in a practical summarization system. For example, sometimes the user may want to take a detailed look at certain scenes or shots which requires more keyframes, and sometimes the user may only need a very coarse summarization which requires fewer keyframes.
What is needed is a system and method to automatically and intelligently generate a scalable video summary of a video document that offers users the flexibility to summarize and navigate the video content to their own desired level of detail.