Among all the sources of video content, unstructured consumer video probably constitutes the content that most people are or would eventually be interested in dealing with. Organizing and editing personal memories by accessing and manipulating home videos represents a natural technological extension to the traditional still picture organization. However, although attractive with the advent of digital video, such efforts remain limited by the size of these visual archives, and by the lack of efficient tools for accessing, organizing, and manipulating home video information. The creation of such tools would also open doors to the organization of video events in albums, video baby books, editions of postcards with stills extracted from video data, multimedia family web-pages, etc. In fact, the variety of user interests suggests an interactive solution, which requires a minimum amount of user feedback to specify the desired tasks at the semantic level, and which provides automated algorithms for those tasks that are tedious or can be performed reliably.
In commercial video, many moving image documents have story structures which are reflected in the visual content. In such situations, a complete moving image document is referred to as a video clip. The fundamental unit of the production of video is the shot, which captures continuous action. The identification of video shots is achieved by scene change detection schemes which give the start and end of each shot. A scene is usually composed of a small number of interrelated shots that are unified by location or dramatic incident. Feature films are typically composed of a number of scenes, which define a storyline for understanding the content of the moving image document.
In contrast with commercial video, unrestricted content and the absence of storyline are the main characteristics of home video. Consumer contents are usually composed of a set of events, either isolated or related, each composed of one or a few shots, randomly spread along time. Such characteristics make consumer video unsuitable for video analysis approaches based on storyline models. However, there still exists a spatio-temporal structure, based on visual similarity and temporal adjacency between video segments (sets of shots) that appears evident after a statistical analysis of a large home video database. Such structure, essentially equivalent to the structure of consumer still images, points towards addressing home video structuring as a problem of clustering. The task at hand could be defined as the determination of the number of clusters present in a given video clip, and the design of an optimality criterion for assigning cluster labels to each frame/shot in the video sequence. This has indeed been the direction taken by most research in video analysis, even when dealing with storylined content.
For example, in U.S. Pat. No. 5,821,945, a technique is described for extracting a hierarchical decomposition of a complex video selection for browsing purposes, and combining visual and temporal information to capture the important relations within a scene and between scenes in a video. Thus, it is said, this allows the analysis of the underlying story structure with no a priori knowledge of the content. Such approaches perform video structuring in variations of a two-stage methodology: video shot boundary detection (shot segmentation), and shot clustering. The first stage is by far the most studied in video analysis (see, e.g., U. Gargi, R. Kasturi and S. H. Strayer, “Performance Characterization of Video-Shot-Change Detection Methods”, IEEE CSVT, Vol. 10, No. 1, February 2000, pp. 1-13). For the second stage, using shots as the fundamental unit of video structure, K-means, distribution-based clustering, and time-constrained merging techniques have all been disclosed in the prior art. Some of these methods usually require setting of a number of parameters, which are either application-dependent or empirically determined by user feedback.
As understood in the prior art, hierarchical representations seem to be not only natural to represent unstructured content, but are probably the best way of providing useful non-linear interaction models for browsing and manipulation. Fortunately, as a byproduct, clustering allows for the generation of hierarchical representations for video content. Different models for hierarchical organization have also been proposed in the prior art, including scene transition graphs (e.g., see the aforementioned U.S. Pat. No. 5,821,945), and tables of contents based on trees, although the efficiency/usability of each specific model remains in general as an open issue.
To date, only a few works have dealt with analysis of home video (e.g., see G. Iyengar and A. Lippman, “Content-based Browsing and Edition of Unstructured Video”, IEEE ICME, New York City, August 2000; R. Lienhart, “Abstracting Home Video Automatically”, ACM Multimedia Conference, Orlando, October, 1999, pp. 37-41; and Y. Rui and T. S. Huang, “A Unified Framework for Video Browsing and Retrieval”, in A. C. Bovik, Ed., Handbook of Image and Video Processing, Academic Press, 1999). The work in the Lienhart article uses time-stamp information to perform clustering for generation of video summaries. Time-stamp information, however, might not always be available. Even though digital cameras include this information, users do not always use the time option. Therefore, a general solution cannot rely on this information. The work in the Rui and Huang article for generation of tables-of-contents, based on very simple statistical assumptions, was tested on some home videos with “storyline”. However, the highly unstructured nature of home video makes the application of specific storyline models quite limited. With the exception of the Iyengar and Lippman article, none of the previous approaches have analyzed in detail the inherent statistics of such content. From this point of view, the present invention is more related to the work in N. Vasconcelos and A. Lippmann, “A Bayesian Video Modeling Framework for Shot Segmentation and Content Characterization”, Proc. CVPR, 1997, that proposes a Bayesian formulation for shot boundary detection based on statistical models of shot duration, and to the work in the Iyengar and Lippmann article that addresses home video analysis using a different probabilistic formulation.
Nonetheless, it is unclear from the prior art that a probabilistic methodology that uses video shots as the unit of organization could support the creation of a video hierarchy for interaction. In arriving at the present invention, statistical models of visual and temporal features in consumer video have been investigated for organization purposes. In particular, a Bayesian formulation seemed appealing to encode prior knowledge of the spatio-temporal structure of home video. In a departure from the prior art, the inventive approach described herein is based on an efficient probabilistic video segment merging algorithm which integrates inter-segment features of visual similarity, temporal adjacency, and duration in a joint model that allows for the generation of video clusters without empirical parameter determination.