1. Field of the Invention
This invention relates to techniques for video summarization based on the singular value decomposition (SVD) technique. The present invention also relates to providing tools for effective searching and retrieval of video sequences according to user-specified queries. In particular, the invention relates to segmentation of video sequences into shots for automated searching, indexing, and access. Finally, this invention relates to a method for extracting of features and metadata from video shots to enable classification, search, and retrieval of the video shots.
2. Description of the Related Art
The widespread distribution of video information in computer systems and networks has presented both excitement and challenge. Video is exciting because it conveys real-world scenes most vividly and faithfully. On the other hand, handling video is challenging because video images are represented by voluminous, redundant, and unstructured data streams which span the time sequence. In many instances, it can be a painful task to locate either the appropriate video sequence or the desired portion of the video information from a large video data collection. The situation becomes even worse on the Internet. To date, increasing numbers of websites offer video images for news broadcasting, entertainment, or product promotion. However, with very limited network bandwidth available to most home users, people spend minutes or tens of minutes downloading voluminous video images, only to find them irrelevant.
Important aspects of managing a large video data collection are providing a user with a quick summary of the content of video footage and enabling the user to quickly browse through extensive video resources. Accordingly, to turn unstructured, voluminous video images into exciting, valuable information resources, browsing and summarization tools that would allow the user to quickly get an idea of the overall content of video footage become indispensable.
Currently, most video browsing tools use a set of keyframes to provide content summary of a video sequence. Many systems use a constant number of keyframes for each detected scene shot, while others assign more keyframes to scene shots with more changes. There are also systems that remove redundancies among keyframes by clustering the keyframes based on their visual similarity. An important missing component in existing video browsing and summarization tools is a mechanism to estimate how many keyframes would be sufficient to provide a good, nonredundant representation of a video sequence.
Simple methods that assign a fixed number of keyframes to each scene shot suffer from poor video content representation, while more sophisticated approaches that adaptively assign keyframes according to the activity levels often rely on the user to provide either the number of keyframes to be generated, or some threshold values (e.g., the similarity distance or the time interval between keyframes), which are used to generate the keyframes. Accordingly, the user must go through several rounds of interactions with the system to obtain an appropriate set of keyframes. This approach is acceptable when the user browses a small set of video images disposed on a local workstation. On the other hand, the approach becomes prohibitive when video images located on the Internet are accessed through a network with very limited bandwidth, or when a video summary must be created for each video image in a large-scale video database.
As mentioned above, existing video browsing and content overview tools utilize keyframes extracted from original video sequences. Many works concentrate on breaking video into shots, and then finding a fixed number of keyframes for each detected shot. For example, Tonomura et al. used the first frame from each shot as a keyframe, see Y. Tonomura, A. Akutsu, K. Otsuji, and T. Sadakata, “Videomap and videospaceicon: Tools for anatormizing video content,” in Proc. ACM INTERCHI'93, 1993. Ueda et al. represented each shot by using its first and last frames, see H. Ueda, T. Miyatake, and S. Yoshizawa, “Impact: An interactive natural-motion-picture dedicated multimedia authoring system,” in Proc. ACM SIGCHI'91, (New Orleans), April 1991. Ferman and Tekalp clustered the frames in each shot, and selected the frame closest to the center of the largest cluster as the keyframe, see A. Fermain and A. Tekalp, “Multiscale content extraction and representation for video indexing,” in Proc. SPIE 3229 on Multimedia Storage and Archiving Systems II, 1997.
An obvious disadvantage of the above equal-density-keyframe assignment is that long shots, which involve camera pans and zooms as well as the object motion, will not be adequately represented. To address this problem, DeMenthon et al. proposed to assign keyframes of a variant number according to the activity level of the corresponding scene shot, see D. DeMenthon, V. Kobla, and D. Doermann, “Video summarization by curve simplification,” Tech. Rep. LAMP-TR-018, Language and Media Processing laboratory, University of Maryland, 1998. The described method represents a video sequence as a trajectory curve in a high dimensional feature space, and uses a recursive binary curve splitting algorithm to find a set of perceptually significant points, which can be used in approximating the video curve. The curve approximation is repeated until the approximation error comes below the user-specified value. Frames corresponding to these perceptually significant points are then used as keyframes to summarize the video contents. Because the curve splitting algorithm assigns more points to segments with larger curvature, this method naturally assigns more keyframes to shots with more variations.
Keyframes extracted from a video sequence may contain duplications and redundancies. For example, in a TV program with two people talking, the video camera usually switches back and forth between the two persons, and inserts some global views of a scene. Applying the above keyframe selection methods to this video sequence will generate many keyframes that are almost identical. To remove redundancies from the produced keyframes, Yeung et al. selected one keyframe from each video shot, performed hierarchical clustering on these keyframes based on their visual similarity and temporal distance, and then retained only one keyframe for each cluster, see M. Yeung, B. Yeo, W. Wolf, and B. Liu, “Video browsing using clustering and scene transitions on compressed sequences,” in Proc, SPIE on Multimedia Computing and Networking, vol. 2417, 1995. Girgensohn and Boreczky also applied the hierarchical clustering technique to group the keyframes into as many clusters as specified by the user. For each cluster, a single keyframe is selected such that the constraints dictated by the requirement of an even distribution of keyframes over the length of the video and a minimum distance between keyframes are met, see A. Girgensohn and J. Boreczky, “Time-constrained keyframe selection technique,” in Proc. IEEE Multimedia Computing and Systems (ICMCS'99), 1999.
To create a concise summary of video contents, it is very important to ensure that the summarized representation of the original video (1) contains little redundancy, and (2) gives equal attention to the same amount of contents. While some of the sophisticated keyframe selection methods address these two issues to variant extents, they often rely on the users to provide either the number of keyframes to be generated, or some thresholds (e.g., a similarity distance between keyframes or approximation errors), which are used in keyframe generation. Accordingly, an optimal set of keyframes can be produced only after several rounds of trials. On the other hand, excessive trials could become prohibitive when video images are accessed through a connection with limited bandwidth, or when a keyframe-set must be created for each video image in a large-scale video database.
Apart from the above problems of keyframe selection, summarizing video contents using keyframes has its own limitations. A video image is a continuous recording of a real-world scene. A set of static keyframes by no means captures the dynamics and the continuity of the video image. For example, in viewing a movie or a TV program, the user may well prefer a summarized motion video with a specified time length to a set of static keyframes.
A second important aspect of managing video data is providing tools for efficient searching and retrieval of video sequences according to user-specified queries. It can be a painful task to find either an appropriate video sequence, or the desired portions of the video hidden within a large video data collection. Traditional text indexing and retrieval techniques have turned out to be powerless in indexing and searching video images. To tap into the rich and valuable video resources, video images must be transformed into a medium that is structured, manageable and searchable.
The initial steps toward the aforementioned goal include the segmentation of video sequences into shots for indexing and access, and the extraction of features/metadata from the shots to enable their classification, search, and retrieval. For video shot segmentation, a number of methods have been proposed in past years. Typical video shot segmentation methods include shot segmentation using pixel values, described in K. Otsuji, Y. Tonomura, and Y. Ohba, “Video browsing using brightness data,” in SPIE Proc. Visual Communications and Image Processing, (Boston), pp. 980–989, 1991, and A. Hampapur, R. Jain, and T. Weymouth, “Digital video segmentation,” in Proceedings of ACM Multimedia 94, (San Francisco), October 1994. Another video segmentation method, described in H. Ueda, T. Miyatake, and S. Yoshizawa, “Impact: An interactive natural-motion-picture dedicated multimedia authoring system,” in Proc. ACM SIGCHI'91, (New Orleans), April 1991, relies on global or local histograms. The use of motion vectors in video segmentation is described in H. Ueda, et al., see above. Discrete cosine transform (DCT) coefficients from MPEG files can also be used for video shot segmentation, see F. Arman, A. Hsu, and M. Y. Chiu, “Image processing on encoded video sequences,” Multimedia Systems, vol. 1, no. 5, pp. 211–219, 1994.
Apart from the aforementioned methods, many other video segmentation techniques have been developed recently. While the vast majority of video segmentation methods use a simple approach of frame-pair comparisons and can detect only abrupt shot boundaries, some more sophisticated segmentation techniques use additional frames in the aforementioned comparison operation to provide for the detection of gradual scene changes, see H. Zhang, A. Kankanhalli, and S. Smoliar, “Automatic partitioning of full-motion video,” Multimedia Systems, vol. 1, pp. 10–28, 1993. As it pertains to the video shot retrieval and classification, the most common approach to date has been to first carry out the video shot segmentation, perform additional operations to extract features from each detected shot, and then create indices and metrics using the extracted features to accomplish shot retrieval and classification. In systems based on this described approach, several of the aforementioned processing steps must be performed simultaneously. As a result, these systems usually suffer from high computational costs and long processing times.
Accordingly, there is a recognized need for, and it would be advantageous to have an improved technique that aims to automatically create an optimal and non-redundant summarization of an input video sequence, and to support different user requirements for video browsing and content overview by outputting either the optimal set of keyframes, or a summarized version of the original video with the user-specified time length.
There is also a demand for, and it would be advantageous to have an improved technique for segmenting video sequences into shots for indexing and access, and the extracting features/metadata from the segmented shots to enable their classification, search, and retrieval.