1. Field of the Invention
The present invention relates to video management systems. More specifically, the invention is directed to a system for automatically processing a video sequence to extract metadata that provides an adequate visual representation of the video.
2. Description of the Related Technology
The management of video data is a critical information management problem. The value of video footage can be effectively utilized only when it can be reused and repurposed in many different contexts. One of the key requirements to effectively access video from a large collection is the ability to retrieve video information by content. Content-based retrieval of video data demands a computer-readable representation of video. This representation of the original video data is called metadata. The metadata includes a representation of the visual, audio and semantic content. In other words, a good representation of a video should effectively capture the look of the video, its sound and its meaning. An effective representation of the video captures the essence of the video in as small a representation as possible. Such representations of the video can be stored in a database. A user trying to access video from a collection can query the database to perform a content-based search of the video collection to locate the specific video asset of interest. FIG. 1 illustrates a block diagram of a video database system 100. Such a system is described in Designing Video Data Management Systems, Arun Hampapur, University of Michigan, 1995, which is herein incorporated by reference. Video data 102 is input into a Metadata Extraction module 104. The resultant metadata is stored in a database system 106 by use of an insertion interface 108.
The extraction (104) of metadata from the actual video data 102 is a very tedious process called video logging or manual annotation. Typically this process requires on average labor of eight times the length of the video. What is desired is a system which would automatically process a video so as to extract the metadata from a video sequence of frames that provides a good visual representation of the video.
Some of the terminology used in the description of the invention will now be discussed. This terminology is explained with reference to a set of example images or frames shown in FIG. 2. Image one shows a brown building 120 surrounded by a green lawn 122 with a blue sky 124 as a background. Image two shows a brown car 126 on a green lawn 128 with a blue sky 130 as a background. Let us assume that these two frames are taken from adjacent shots in a video. These two frames can be compared based on several different sets of image properties, such as color properties, distribution of color over the image space, structural properties, and so forth. Since each image property represents only one aspect of the complete image, a system for generating an adequate representation by extracting orthogonal properties from the video is needed. The two images in FIG. 2 would appear similar in terms of their chromatic properties (both have approximately the same amount of blue, green and brown color""s) but would differ significantly in terms of their structural properties (the location of edges, how the edges are distributed and connected to each other, and so forth).
An alternate scenario is where the two images differ in their chromatic properties but are similar in terms of their structural properties. An example of such a scenario occurs when there are two images of the same scene under different lighting conditions. This scenario also occurs when edit effects are introduced during the film or video production process like when a scene fades out to black or fades in from black.
Given any arbitrary video, the process used for generating an adequate visual representation of the video must be able to effectively deal with the situations outlined in the above discussion. The use of digital video editors in the production process is increasing the fraction of frames which are subjected to digital editing effects. Thus an effective approach to generating adequate visual representations of videos is desired that uses both chromatic and structural measurements from the video.
Several prior attempts at providing an adequate visual representation of the visual content of a video have been made: Arun Hampapur, Designing Video Data Management Systems, The University of Michigan, 1995; Behzad Shahraray, Method and apparatus for detecting abrupt and gradual scene changes in image sequences, ATandT Corp, 32 Avenue of the Americas, New York, N.Y. 10013-2412, 1994, European Patent Application number 066327 A2; Hong Jiang Zhang, Stephen W Smoliar and Jian Hu Wu, A system for locating automatically video segment boundaries and for extracting key-frames, Institute of System Science, Kent Ridge, Singapore 0511, 1995, European Patent Application number 0 690413 A2; and Akio Nagasaka and Yuzuru Tanaka, xe2x80x9cAutomatic Video Indexing and Full-Video Search for Object Appearancesxe2x80x9d, Proceedings of the 2nd Working Conference on Visual Database Systems, p.119-133, 1991. Most existing techniques have focused on detecting abrupt and gradual scene transitions in video. However, the more essential problem to be solved is deriving an adequate visual representation of the visual content of the video.
Most of the existing scene transition detection techniques, including Shahraray and Zhang et al., use the following measurements for gradual and abrupt scene transitions: 1) Intensity based difference measurements wherein the difference between two frames from the video which are separated by some time interval xe2x80x9cTxe2x80x9d, is extracted. Typically, the difference measures include pixel difference measures, gray level global histogram measures, local pixel and histogram difference measures, color histogram measures, and so forth. 2) Thresholding of difference measurements wherein the difference measures are thresholded using either a single threshold or multiple thresholds.
However, to generate an adequate visual representation of the visual content of the video, a system is needed wherein the efficacy of the existing techniques is not critically dependent on the threshold or decision criteria used to declare a scene break or scene transition. Using existing techniques, a low value of the threshold would result in a oversampled representation of the video, whereas, a higher value would result in the loss of information. What is needed is a system wherein the choice of the decision criteria is a non-critical factor.
One embodiment of the present invention includes a computer-based system for identifying keyframes or a visual representation of a video by use of a two stage measurement process. Frames from a user-selected video segment or sequence are processed to identify the keyframes. The first stage preferably includes a chromatic difference measurement to identify a potential set of keyframes. To be considered a potential frame, the measurement result must exceed a user-selectable chromatic threshold. The potential set of keyframes is then passed to the second stage which preferably includes a structural difference measurement. If the result of the structural difference measurement then exceeds a user-selectable structural threshold, the current frame is identified as a keyframe. The two stage process is then repeated to identify additional keyframes until the end of the video. If a particular frame does not exceed either the first or second threshold, the next frame, after a user-selectable time delta, is processed.
The first stage is preferably computationally cheaper than the second stage. The second stage is more discriminatory since it preferably operates on a smaller set of frames. The keyframing system is extensible to additional stages or measurements as necessary.
In one aspect of the invention, there is a method for detecting scene changes in a digital video data stream displayed upon a monitor coupled to a computer executing an operating system including a software display control program operative to control display of all information displayed upon the monitor, said method comprising the steps of (a) providing a scene detection software program executed by the computer, wherein said scene detection software program and said software display control program are separate programs, said scene detection software program performing the following steps: (b) retrieving information for each first pixel in a first frame of the digital video data stream from said software display control program; (c) retrieving information for each second pixel in a second frame of the digital video data stream from said software display control program; and (d) detecting a scene change if the second pixel information differs from the first pixel information by more than a predetermined amount.
In another aspect of the invention, there is a method for detecting scene changes in a digital video data stream displayed upon a monitor coupled to a computer executing an operating system including a software display control program operative to control display of all information displayed upon the monitor, said method comprising the steps of (a) providing a scene detection software program executed by the computer, wherein said scene detection software program and said software display control program are separate programs, said scene detection software program performing the following steps: (b) retrieving information for each first pixel in a first frame of the digital video data stream from said software display control program; (c) retrieving information for each second pixel in a second frame of the digital video data stream from said software display control program; (d) detecting a scene change if the second pixel information differs from the first pixel information by more than a predetermined amount; (e) recording an index representative of where the scene change occurred in the digital video data stream; and (f) recording a representative frame of a scene bounded by the scene change.
In another aspect of the invention, there is a method for detecting scene changes in a digital video data stream displayed upon a monitor coupled to a computer executing an operating system including a software display control program operative to control display of all information displayed upon the monitor, said method comprising the steps of (a) providing a scene detection software program executed by the computer, wherein said scene detection software program and said software display control program are separate programs, said scene detection software program performing the following steps: (b) retrieving information for a first frame of the digital video data stream from said software display control program; (c) retrieving information for a second frame of the digital video data stream from said software display control program; and (d) detecting a scene change between the first frame and the second frame using the first frame information and the second frame information.
In yet another aspect of the invention, there is a method for detecting scene changes in a digital video data stream displayed upon a monitor coupled to a computer executing an operating system including a software display control program operative to control display of all information displayed upon the monitor, said method comprising the steps of (a) providing a scene detection software program executed by the computer, wherein said scene detection software program and said software display control program are separate programs, said scene detection software program performing the following steps: (b) retrieving digital video data stream information from said software display control program; and (c) detecting a scene change in said digital video data stream using said information.