With the rapid technology advances in digital television, multimedia, and Internet, there has recently been an increase of the amount of digital image, audio, video data available for consumption by video consumers. Among all the media types, video is frequently characterizing as the most challenging type of media to manipulate, as it combines all other media information into one single data stream. In recent years, due in part to the decreasing cost of storage devices, higher transmission rates, and improved compression techniques, digital videos are becoming available at an ever-increasing rate. Thanks to the increasing availability of computing resources and the popularity of so-called Web 2.0 related technologies, a growing number of user-centric applications are available that allow ordinary people to record, edit, deliver and publish their own home-made digital videos on social web or networks (e.g., YouTube). As a result, the interaction with videos has become an important part of the daily lives of modern individuals, and many related applications have emerged.
Currently, as a key element of multimedia computing, digital video has been widely employed in many industries as well as in various systems. However, because of the tendency for videos to have long durations and an unstructured format, efficient access to video, especially video content-based access, is difficult to achieve. In other words, the increasing availability of digital video has not been accompanied by an increase in its accessibility. The abundance of video data makes it increasingly difficult for users to efficiently manage and navigate their video collections. Therefore, a need has arisen for the development of efficient and effective automated techniques for users to navigate and analyze video content.
The field of video summarization aims to organize video data into a compact form and to extract meaningful information from that video data. In general, current video summarization technologies can be categorized into two main types: static video summarization and dynamic video summarization.
Static video summarization generally refers to segmenting a whole video stream into several partitions (i.e., video shots). For each segment or shot, one or more frames are extracted as the key frames. The result of such static summarization is to arrange those key frames sequentially or hierarchically. Various static video summarization techniques are described in the article “A novel video summarization based on mining the story-structure and semantic relations among concept entities” (IEEE Transactions Multimedia, vol. 11, No. 2, pp. 295-312, 2009) and the article “Hierarchical video summarization and content description joint semantic and visual similarity” (ACM Multimedia System, vol. 9, No. 1, 2003).
Although static video summarization can offer users a comprehensive view of video by generating a visual abstract of video content in a concise and informative way, it is susceptible to a smoothness problem—that is, users may feel uncomfortable while browsing the results due to a lack of smoothness while browsing. For example, given a video with long duration, it is common to generate thousands of key frames using the above static methods. This characteristic is evidenced by the specific example that in the feature-length movie Terminator 2: Judgment Day, there are 300 shots in a 15-minute video segment, and the movie lasts 139 minutes. The static key frames' sequential layout for such a complex feature length video may thus be meaningless for users' semantic video content understanding.
Dynamic video summarization is an alternative solution to generate so-called video skims (temporal continuous segments) from an original video stream. An example of such dynamic video summarization, known as hidden Markov model (HMM), was used to generate a video skim that was described in the article by S. Benini et al. (Hidden Markov models for video skim generation, Proc. of 8th International Workshop on Image Analysis for Multimedia Interactive Services, June 2007). A video skim method considering different features (audio, visual, and text) together was proposed in the article “Video skimming and characterization through the combination of image and language understanding” (Proc. of IEEE International Workshop on Content-based Access Image Video Data Base, pp. 61-67, January 1998). The authors in the article “A user attention model for video summarization” (Proc. of 10th ACM Multimedia, pp. 533-542, December 2002) tried to create video skims using attention models.
In general, the high computational complexity of such dynamic video summarization techniques makes them infeasible in practice. For example, the above HMM-based method has to estimate the model parameters first before they can be applied to create video skims. In current video players, the uniform fast-forward mode is still the only way for users' rapid video navigation. The traditional fast-forward is a sampling procedure to play and skip video frames uniformly. However, the uniform sampling may not be effective to capture the semantic information of video data.
In addition, most existing summarization methods are video shot-based. However, the physical structure-based video analysis is not directly related to the semantic video content understanding.
Therefore, a method is needed that grasps the important video content ignored by the traditional fast-forward mode effectively and makes the content-based rapid video navigation feasible in practice.
There is a need to overcome the disadvantages described above. There is also a need to provide improvements to known video summarization techniques.