Multimedia information systems include vast amounts of video, audio, animation, and graphics information. In order to manage all this information efficiently, it is necessary to organize the information into a usable format. Most structured videos, such as news and documentaries, include repeating shots of the same person or the same setting, which often convey information about the semantic structure of the video. In organizing video information, it is advantageous if this semantic structure is captured in a form which is meaningful to a user.
Prior attempts have been made in organizing video. Database systems typically use attribute-based indexing that involves manually segmenting video into meaningful semantic units. Multimedia information is abstracted by reducing the scope for posing ad hoc queries to the multimedia database. See P. England et al., I/Browse: The Bellcore Video Library Toolkit, Storage and Retrieval for Still Image and Video Databases IV, SPIE, 1996. Attribute-based indexing, however, is extremely time consuming because a human operator manually indexes the multimedia information.
Computer vision systems typically use an automatic, integrated feature extraction/object recognition subsystem which eliminates the manual video segmentation of attribute-based indexing. See M. M. Yeung et al., Video Browsing using Clustering and Scene Transitions on Compressed Sequences, Multimedia Computing and Networking, SPIE vol. 2417, pp 399-413, 1995; H. J. Zhang et al., Automatic parsing of news video, International Conference on Multimedia Computing and Systems, pp 45-54, 1994; and D. Swanberg et al., Knowledge guided parsing in video databases, Storage and Retrieval for Image and Video Databases, SPIE vol. 1908, pp 13-25, 1993. These automatic methods attempt to capture the semantic structure of video, however, they are computationally expensive and difficult, extremely domain specific, and create hierarchies or indexes with only a few fixed number of levels. For example, in the article by Zhang et al., known templates of anchor person shots are used to separate news stories. A shot in video refers to a contiguous recording of one or more raw frames of video depicting a continuous action in time and space. In the article by Swanberg et al., news videos are segmented or parsed using a known scene structure of news programs and models of anchor person shots. News videos have also been segmented by using the presence of a channel logo, the skin tones of the anchor person and the scene structure of the news episode. See B. Gunsel et al., Video Indexing through Integration of Syntactic and Semantic Features, IEEE Multimedia Systems, pp 90-95, 1996. Content-based indexing at the shot level using motion (without developing a high-level description of the video) has been described by F. Arman et al., Content-based browsing of video sequences, ACM Multimedia, pp 97-103, August, 1994.
Domain dependent approaches, however, can not be used to capture the semantic structure in video for all possible scenarios, even for a very simple domain such as the news. For example, not every news story in a news broadcast begins with an anchor person shot and it is difficult to define an anchor person image model that is generic to all broadcast stations.
A domain-independent approach that extracts story units for video browsing applications, has been described by M. M. Yeung et al., Time-constrained Clustering for Segmentation of Video into Story Units, International Conference on Pattern Recognition, C, pp. 375-380, 1996. FIG. 1 shows a scene transition graph which provides a compact representation that serves as a summary of the story and may also provide useful information for automatic classification of video types. The scene transition graph is generated by detecting shots, identifying shots that have similar visual appearances, and detecting story units. However, the graph reveals only limited information about the semantic structure within a story unit. For example, an entire news broadcast is classified as one single story, making it difficult for users to browse through the news stories individually.
Capturing the semantic structure in a video requires accurate shot detection and the shot grouping. Most existing shot detection methods are based on preset thresholds or assumptions that reduce their applicability to a limited range of video types. For example, many existing methods make assumptions about how shots are connected in videos, ignoring how films/videos are produced and edited in reality. See P. Aigrain et al., The Automatic Real-Time Analysis of Film Editing and Transition Effects and its Applications, Computer and Graphics, Vol. 18, No. 1, pp. 93-103, 1994; A. Hampapur et al., Digital Video Segmentation, Proc. ACM Multimedia Conference, pp. 357-363, 1994; and J. Meng et al., Scene Change Detection in a MPEG Compress ed Video Sequence, SPIE Vol. 2419, Digital Video Compression Algorithms and Technologies, pp. 14-25, 1995. These methods often assume that both the incoming and outgoing shots are static scenes with transitions which last for a period no longer than half a second. These assumptions do not provide sufficient data for modeling gradual shot transitions that are often present in films/videos. Existing shot detection methods also assume that time-series difference metrics are stationary, ignoring the fact that such metrics are highly correlated time signals. It is also assumed that the frame difference signal computed at each individual pixel can be modeled by a stationary, independent, identically distributed random variable which obeys a known probability distribution such as the Gaussian or Laplace. See H. Zhang et al., Automatic Parsing of Full-Motion Video, ACM Multimedia Systems, 1, pp. 10-28, 1993. FIGS. 2A and 2B are histograms of typical inter-frame difference images that do not correspond to shot changes. FIG. 2A, shows a histogram as the camera moves slowly left. FIG. 2B depicts as the camera moves quickly right. The curve of FIG. 2A is shaped differently from the curve of FIG. 2B. Neither a Gaussian nor a Laplace fits both of these curves well. A Gamma function fits the curve of FIG. 2A well, but not the curve of FIG. 2B.
Additionally , many videos are converted from films. Video and films are played at different frame rates thus, every other film frame is made a little bit longer to convert it to video. Consequently, the video frames are made up of two fields with totally different (although consecutive) pictures in them. As a result, the digitization produces duplicate video frames and almost zero inter-frame differences at five frame intervals. A similar problem occurs in animated videos such as cartoons except, it produces almost zero inter-frame differences in as often as every other frame.
Color histograms are typically used for grouping visually similar shots as described in M. J. Swain et al., Indexing via Color Histograms, Third International Conference on Computer Vision, pp. 390-393, 1990. However, a color histogram's ability to detect similarities when illumination variations are present is substantially affected by the color space used and color space quantizing. Commonly used RGB and HSV color spaces are sensitive to illumination factors in varying degrees, and uniform quantization goes against the principles of human perception. See G. Wyszecki et al., Color Science: Concepts and Methods, Quantitative Data and Formulae, John Wiley & Sons, Inc. 1982.
Thus, in practice, it is difficult to obtain a useful video organization based solely on automatic processing.
Accordingly, there is a need for a system which makes automatically extracted video structures more meaningful and useful.