1. Field of the Invention
The present invention relates to the field of interactively processing video for the purpose of automatically locating specific content. Specifically, the present invention pertains to the field of interactively defining training images and displaying similarity search results.
2. Discussion of the Related Art
Most state-of-the-art systems for video retrieval first segment video into shots, and then create a single keyframe or multiple keyframes for each shot. Video segment retrieval then reduces to image retrieval based on keyframes. More complex conventional systems average color and temporal variation across the query segment, but then perform retrieval based on keyframes in the segmented video. Conventional systems have been designed to find video sequences that exactly match the query, for example instant replays.
There has been much work on still image retrieval by similarity. Retrieval based upon color histogram similarity has been described. Several image similarity measures have been based on wavelet decompositions. Quantizing and truncating higher-order coefficients reduces the dimensionality, while the similarity distance measure is just a count of bitwise similarity. However, this approach has apparently not been used with the discrete cosine transform or the Hadamard transform. All known image retrieval-by-similarity systems require a single image as a query and do not naturally generalize to image groups or classes. Although there has been much work on video queries, much of the literature focuses on query formalisms while presupposing an existing analysis or annotation.
Due to the high cost of video processing, little work has been done on rapid similarity measures. Analysis of individual image frames with a combination of color histogram and pixel-domain template matching has been attempted, though templates must be hand-tailored to the application and so do not generalize. Another distance metric technique is based on statistical properties such as a distance based on the mean and standard deviation of gray levels in regions of the frames.
Other conventional approaches include queries by sketch, perhaps enhanced with motion attributes. As far as using actual video clips as queries, the few reports in the literature include a system in which video xe2x80x9cshotsxe2x80x9d are represented by still images for both query and retrieval, and a system in which video segments are characterized by average color and temporal variation of color histograms. A similar approach involves, after automatically finding shots, they are compared using a color histogram similarity measure. Matching video sequences using the temporal correlation of extremely reduced frame image representations has been attempted. While this can find repeated instances of video shots, for example xe2x80x9cinstant replaysxe2x80x9d of sporting events, it is not clear how well it generalizes to video that is not substantially similar. Video similarity has been computed as the euclidean distance between short windows of frame distances determined by distance of image eigen projections. This appears to find similar regions in the test video, but may not generalize well as it depends on the video used to calculate the eigen projection. Video indexing using color histogram matching and image correlation has been attempted, though it is not clear the correlation could be done rapidly enough for most interactive applications. Hidden Markov model video segmentation using motion features has been studied, but does not use image features directly or use for image features for image similarity matching.
In addition to providing predefined classes for video retrieval and navigation, video classification techniques can be used for other purposes as well. When during video play-back users see an interesting scene such as the close-up on a speaker in a presentation, they may be interested in finding similar scenes even if no predefined image class exists for that particular situation. The present invention provides a method for interactively selecting a scene from a video and finding similar scenes in the video. The present invention includes a system that can rapidly find time intervals of video similar to that selected by a user. When displayed graphically, similarity results assist in determining the structure of a video or browsing to find a desired point. Because each video frame is represented as a small number of coefficients, similarity calculations are extremely rapid, on the order of thousands of times faster than real-time. This enables interactive applications according to the present invention.
Conventional systems lack the specificity, generality, or speed to interactively find similar video regions. Conventional color-based systems result in too many false positive similarity matches. Conventional systems based on pixel-domain approaches are either too computationally demanding, such as the image-domain correlation matching, or too specific in that video must be nearly identical to be judged similar. In contrast, according to the present invention, the reduced-transform features and statistical models are accurate, generalize well, and work rapidly.
The present invention is embodied in a system for interactively browsing, querying, and retrieving video by similarity. Interactively selected video regions are used to a train statistical model on-the-fly. Query training segments are either individual frames, segments of frames, non-contiguous segments, or collections of images. The system can also be used to retrieve similar images from one or more still images. Similarity measures are based on statistical likelihood of the reduced transform coefficients. The similarity is rapidly calculated, graphically displayed, and used as indexes for interactively locating similar video regions.
According to the present invention, search and segmentation are done simultaneously, so that prior segmentation of the video into shots is not required. Each frame of the video is transformed using a discrete cosine transform or Hadamard transform. The transformed data is reduced by discarding less important coefficients, thus yielding an efficient representation of the video. The query training segment or segments are used to train a Gaussian model. A simple search can then be performed by computing the probability of each video frame being produced by the trained Gaussian model. This provides a sequence of confidence scores indicating the degree of similarity to the query. Confidence scores are useful in a video browser, where similarity can be readily displayed.
According to an embodiment of the present invention, reduced transform coefficients corresponding to each frame in the video are stored in a precomputed feature vector database. This feature vector database is accessed both for training statistical models after selection of a query training segment, and for evaluating similarity of each frame once the statistical model is trained.
The present invention includes methods for retrieving video segments by similarity. The user forms a query by selecting a video segment or segments. A statistical model of the query video segment is formed and is used to search the video for similar segments. The similarity score for each frame is computed based on image transform coefficients. Similar video segments in the video database are identified and presented to the user. Rather than returning a discrete set of similar video clips, the system provides a similarity score that can be used in a video browser to view more or less similar segments.
According to an aspect of the present invention, a time bar below the video window displays the likelihood of each frame and thus the degree of similarity to the query training segment. The darker the bar, the more similar the video is to the query training segment. This browser is also used to randomly access the similar segments by clicking on the similar sections of the time bar. The user may interactively define one or more training video segment by mouse click-and-drag operations over a section of the time bar.
According to another aspect of the present invention, a web-based browser displays all frames at a periodic predetermined time interval, such as five seconds, in the video. The user selects the training video segment or segments by selecting adjacent periodic frames. All non-displayed intervening frames are then used as the training segment. For example, all frames in the five second interval between two selected adjacent periodic frames are used to as a training segment. Once calculated, similarity is displayed as a shade around the displayed periodic frames.
According to another aspect of the present invention, an adjustable threshold slider bar is provided in the browser. Frames having similarly scores above the threshold are marked as similar. Video segmentation is performed from a frame-by-frame measure of similarity. A Gaussian model can be used for segmentation by finding when the model likelihood crosses a threshold. Contiguous similar frames define a similar segment. Similar segments are displayed in the browser, and skip forward and backward buttons may be used for browsing to the beginning of the next subsequent or previous similar segment. If the time bar is activated in this segmentation, dark sections of the time bar indicate similar segments, and white sections of the time bar indicate non-similar segments.