The proliferation of digital cameras has led to an explosion in the number of digital videos that are captured and stored in consumer collections. This has created a demand for automated tools for efficient browsing, searching, and utilization of videos in large personal video collections.
Video summarization is a mechanism to produce a condensed or summarized version of an original video sequence by analyzing the underlying content in the entire video stream. Video summarization is an important tool to facilitate video browsing and search, and has been extensively explored in the prior art. A wide variety of types of information have been utilized in video summarization processes, including text descriptions, visual appearances, and audio sounds. A relatively comprehensive survey can be found in the article by Money et al., entitled “Video summarisation: A conceptual framework and survey of the state of the art” (Journal of Visual Communication and Image Representation, Vol. 19, pp. 121-143, 2008).
Most previous video summarization work has been designed to process videos with a high quality level (e.g., videos having a relatively high resolution, stable camera position, and low background noise in both audio and visual signals). Specifically, they have mainly focused upon certain professional video genres such as sports, news, TV drama, or movie dialog. As yet, little work has been done to provide methods that are well-suited for use with consumer-quality videos, which are typically captured under uncontrolled conditions and have diverse content and quality characteristics.
One major reason that research on consumer video summarization is lacking is because of the challenging issues of content analysis in consumer-quality videos. First, in contrast to videos from sporting events or television dramas, there is typically a lack of specific domain knowledge to guide video summarization systems due to the diverse video content characteristics.
Second, a consumer video typically has one long shot, with challenging conditions such as uneven illumination, clutter, occlusions, and complicated motions of objects and the camera. Additionally, the audio soundtrack includes multiple sound sources in the presence of high levels of background noise. As a result, it is difficult to identify specific objects or events from the video sequences, and it is hard to identify semantically meaningful audio segments. Consequently, methods that rely upon object/event detection or special sound effect detection cannot be easily applied to consumer video sequences.
Some prior art non-domain specific video summarization methods rely on an accurate knowledge of object/camera motion. Since it is difficult to accurately assess this information from consumer video sequences, such methods generally do not perform well either.
Another barrier to the development of video summarization methods for use with consumer videos is the difficulty in assessing a user's satisfaction with the generated video summaries. Previous studies, such as that described by Forlines et al. in the article entitled “Subjective assessment of consumer video summarization” (SPIE Conf. Multimedia Content Analysis, Management and Retrieval, Vol. 6073, pp. 170-177, 2006), show that due to the subjective nature of the problem, the actual consumer needs can only be determined from in-depth user studies.
There remains a need for a robust video summarization method that can be applied to consumer video sequences.