1. Field of the Invention
The present invention relates to the field of processing video for the purpose of automatically classifying video according to content. Specifically, the present invention pertains to the field of classifying video frames according to similarity to predefined video image classes as measured by image class statistical models derived from training frames.
2. Discussion of the Related Art
Conventional systems have been developed which classify pre-segmented video clips using features specifically selected for the classes, so that arbitrary class selection is not possible. Other conventional systems discriminate between news and sports video clips, again pre-segmented, using motion features alone. Some conventional systems have identified video frames that are similar to a given video frame image. Other conventional systems use spatial template matching and color histograms to segment video; however, the templates must be created by hand. Much work has been done on segmenting video using compressed-domain features such as block-transform coefficients. Though these compressed-domain approaches are effective, the block-transform domain, like color histograms, are unable to capture significant image features.
Hidden Markov models have been used to segment video, but classification was not considered. Color histogram features and motion cues have been used for video segmentation using hidden Markov models. Markov-like finite state machines have been used on principal components of the frame pixel intensities, but not on transform features. Hidden Markov model video segmentation using motion features has been attempted, but does not use image features directly or for use in similarity matching. A system for matching video sequences using the temporal correlation of extremely reduced frame image representations has been attempted. While this conventional approach can find repeated instances of video shots, for example xe2x80x9cinstant replaysxe2x80x9d of sporting events, it is not unclear how well it generalizes to video that is not substantially similar.
Individual image frames have been analyzed with a combination of color histogram and pixel-domain template matching. Color histograms, as well as motion and texture features, have been used to segment video. Indexing video in the compressed domain, using the sub-block and motion information already present in MPEG-encoded video has been studied. Video shot matching by comparing time sequences of rank-based frame xe2x80x9cfingerprintsxe2x80x9d has been experimented with. Many conventional image retrieval systems used statistics of block-transform coefficients.
The exception to block transforms seems to be wavelet approaches, which typically analyze an entire image using a wavelet basis. Quantizing and truncating higher-order coefficients reduces the dimensionality, while the similarity distance measure is just a count of bitwise similarity. This approach apparently has not been used with more traditional transforms such as the discrete cosine transform or Hadamard transform, nor has it been applied to video. Neural-network and decision-tree approaches have been used to classify images, but in the spatial (pixel intensity) domain. A radial projection of fast fourier transform coefficients has been used a signature for image retrieval.
Automatic classification of video is useful for a wide variety of applications, for example, automatic segmentation and content-based retrieval. Applications using automatic classification can support users in browsing and retrieving digitized video. Other applications include identifying facial close-up video frames before running a computationally expensive face recognizer. Conventional approaches to video classification and segmentation use features such as color histograms or motion estimates, which are less powerful than the features employed according to the methods of the present invention. Unlike many similarity measures based on color histograms, this approach models image composition features and works on black-and-white as well as color sources. According to the present invention, Gaussian image class statistical models capture the characteristic composition and shape of an image class, while also modeling the variation due to motion or lighting differences. Conventional approaches must also perform segmentation prior to classification. In contrast, according to the present invention, classification and segmentation are performed simultaneously using the same features for both. According to the present invention, automatic classification and retrieval of video sequences according to a set of predefined video image classes is achieved.
According to an aspect of the present invention, a feature set used for classification of video sequences is either determined from the training images, determined from one or more videos, or is predetermined. The feature set used for classification may be predetermined by truncation to be the coefficients of the lowest frequency components of the transform matrices. Alternatively, the feature set used for classification may be determined by principal component analysis, by selecting the coefficients having the highest average magnitudes, by selecting the coefficients having the highest average variances, or by linear discriminant analysis of the transform coefficients from the training images or from the frames of one or more videos. Preferably, the same feature set is used for all video image class statistical models in order for the same feature vector of each frame to be retrieved and analyzed for all video image class statistical models.
According to another aspect of the present invention, a Gaussian model is trained by computing a mean feature vector and a variance feature vector from the feature vectors extracted from a training images. Multiple Gaussian models corresponding to multiple video image classes are defined using multiple classes of training images. To constuct a video image class statistical model, training images are transformed using a discrete cosine transform or a Hadamard transform. For accurate modeling with reduced computation, the dimensionality of the coefficient matrices is reduced either by truncation or principal component analysis, resulting in a feature set. Statistical model parameters for a particular video image class are calculated from feature vectors extracted from transform matrices of the training images. Class transition probabilities associated with each of the multiple classes allows hidden Markov models to factor information about video image class sequences into the classification process according to the present invention. A user may easily pre-define an arbitrary set of video classes, and to train a system to segment and classify unknown video according to these classes.
According to yet another aspect of the present invention, video frame classification of a video frame into one of several video image classes is performed by analyzing the feature vector corresponding to the video frame with each image class statistical model. Using a Gaussian model, the probability that the image class statistical model generates the feature vector corresponding to the video is computed. If several video image class statistical models are used for analyzing the feature vector corresponding to a video frame, the video frame is classified into the image class having the image class statistical model resulting in the highest probability of generating the feature vector corresponding to the video frame. While more computationally demanding, the classification using hidden Markov models has the advantage of providing superior classification and segmentation, since class transition probabilities explicitly model class durations and sequences. If only a single video image class statistical model is used for analyzing the feature vector corresponding to the video frame, the difference in the feature vector and the mean feature vector is computed. This difference can be used as a distance measure to judge how similar the test data is to the training frames. The magnitude of the difference is compared to a predetermined multiple of the standard deviation of the image class statistical model. The frame may be classified as similar or non-similar based upon the result of the comparison.
According to still another aspect of the present invention, the logarithm of the probability of feature vector being generated by the image class statistical model is computed and displayed as a graphical indication of similarity of frames to the image class.
These and other features and advantages of the present invention are more fully described in the Detailed Description of the Invention with reference to the Figures.