1. Field of the Invention
The present invention relates to video management systems. More specifically, the invention is directed to a system for automatically processing a video sequence to extract metadata that provides an adequate visual representation of the video.
2. Description of the Related Technology
The management of video data is a critical information management problem. The value of video footage can be effectively utilized only when it can be reused and repurposed in many different contexts. One of the key requirements to effectively access video from a large collection is the ability to retrieve video information by content. Content-based retrieval of video data demands a computer-readable representation of video. This representation of the original video data is called metadata. The metadata includes a representation of the visual, audio and semantic content. In other words, a good representation of a video should effectively capture the look of the video, its sound and its meaning. An effective representation of the video captures the essence of the video in as small a representation as possible. Such representations of the video can be stored in a database. A user trying to access video from a collection can query the database to perform a content-based search of the video collection to locate the specific video asset of interest. FIG. 1 illustrates a block diagram of a video database system 100. Such a system is described in Designing Video Data Management Systems, Arun Hampapur, University of Michigan, 1995, which is herein incorporated by reference. Video data 102 is input into a Metadata Extraction module 104. The resultant metadata is stored in a database system 106 by use of an insertion interface 108.
The extraction (104) of metadata from the actual video data 102 is a very tedious process called video logging or manual annotation. Typically this process requires on average labor of eight times the length of the video. What is desired is a system which would automatically process a video so as to extract the metadata from a video sequence of frames that provides a good visual representation of the video.
Some of the terminology used in the description of the invention will now be discussed. This terminology is explained with reference to a set of example images or frames shown in FIG. 2. Image one shows a brown building 120 surrounded by a green lawn 122 with a blue sky 124 as a background. Image two shows a brown car 126 on a green lawn 128 with a blue sky 130 as a background. Let us assume that these two frames are taken from adjacent shots in a video. These two frames can be compared based on several different sets of image properties, such as color properties, distribution of color over the image space, structural properties, and so forth. Since each image property represents only one aspect of the complete image, a system for generating an adequate representation by extracting orthogonal properties from the video is needed. The two images in FIG. 2 would appear similar in terms of their chromatic properties (both have approximately the same amount of blue, green and brown color's) but would differ significantly in terms of their structural properties (the location of edges, how the edges are distributed and connected to each other, and so forth).
An alternate scenario is where the two images differ in their chromatic properties but are similar in terms of their structural properties. An example of such a scenario occurs when there are two images of the same scene under different lighting conditions. This scenario also occurs when edit effects are introduced during the film or video production process like when a scene fades out to black or fades in from black.
Given any arbitrary video, the process used for generating an adequate visual representation of the video must be able to effectively deal with the situations outlined in the above discussion. The use of digital video editors in the production process is increasing the fraction of frames which are subjected to digital editing effects. Thus an effective approach to generating adequate visual representations of videos is desired that uses both chromatic and structural measurements from the video.
Several prior attempts at providing an adequate visual representation of the visual content of a video have been made: Arun Hampapur, Designing Video Data Management Systems, The University of Michigan, 1995; Behzad Shahraray, Method and apparatus for detecting abrupt and gradual scene changes in image sequences, AT&T Corp, 32 Avenue of the Americas, New York, N.Y. 10013-2412, 1994, European Patent Application number 066327 A2; Hong Jiang Zhang, Stephen W Smoliar and Jian Hu Wu, A system for locating automatically video segment boundaries and for extracting key-frames, Institute of System Science, Kent Ridge, Singapore 0511, 1995, European Patent Application number 0 690413 A2; and Akio Nagasaka and Yuzuru Tanaka, “Automatic Video Indexing and Full-Video Search for Object Appearances”, Proceedings of the 2nd Working Conference on Visual Database Systems, p. 119-133, 1991. Most existing techniques have focused on detecting abrupt and gradual scene transitions in video. However, the more essential problem to be solved is deriving an adequate visual representation of the visual content of the video.
Most of the existing scene transition detection techniques, including Shahraray and Zhang et al., use the following measurements for gradual and abrupt scene transitions: 1) Intensity based difference measurements wherein the difference between two frames from the video which are separated by some time interval “T”, is extracted. Typically, the difference measures include pixel difference measures, gray level global histogram measures, local pixel and histogram difference measures, color histogram measures, and so forth. 2) Thresholding of difference measurements wherein the difference measures are thresholded using either a single threshold or multiple thresholds.
However, to generate an adequate visual representation of the visual content of the video, a system is needed wherein the efficacy of the existing techniques is not critically dependent on the threshold or decision criteria used to declare a scene break or scene transition. Using existing techniques, a low value of the threshold would result in a oversampled representation of the video, whereas, a higher value would result in the loss of information. What is needed is a system wherein the choice of the decision criteria is a non-critical factor.