The present invention relates generally to video signal processing. More particularly, the invention relates to a video indexing and image retrieval system.
Over the last few years, there has been a dramatic increase in available bandwidth over high-speed networks. At the same time, computer manufacturers improved the storage capacities of hard drives on personal computers, and improved the speed of the system bus and the motherboards that access the hard drives. The quality and efficiency of data compression algorithms has likewise improved information transmission efficiency and access rates—particularly with respect to video data.
One of the most important tasks of a database manager is to provide easy and intuitive access to data. This task can be particularly difficult when a user would like to search for images or other visual data such as video segments. Browsing and searching for data is one useful way that the Internet allows users to access related Internet pages rapidly and intuitively through text-based searches.
To allow the searching of visual data, an image retrieval system must be able to emphasis the similarity of a query image with images or frames of video stored in a database. There are several ways that a user may provide a query image. For example, users may have a rough idea of an image that they are looking for. The user may develop a simple sketch of an image by hand and a scanner can be used to upload the sketch or drawing software can be used. A photo of the image or a similar image can be used to find other similar images in the database.
An image search engine must be able to generate a measurement of the similarity between the query image and database images so that the user is presented with a list of the most relevant database images to the least relevant images. The image search engine associated with the image retrieval system must be able to look for similarities between significant features of the query sketch or image and the database images while ignoring minor detail variations. In other words, the image search engine must measure the visual similarity between the query image and the database images invariantly.
When searching the video sequences, it would be inefficient for the image retrieval system to compare the query image to every frame of the video sequence. A video sequence typically contains one or more shots. A shot is a sequence of related frames that are taken by one camera without interruption. To avoid the inefficiency, the image database manager must take the time to segment the shots and identify a key frame to represent the shots. To simplify this problem, it is desirable to perform video segmentation and key frame identification automatically.
A first step towards automatic video indexing is the ability to identify both abrupt transitions and gradual transitions. An abrupt transition is a discontinuous transition between two images and is also referred to as a cut transition. Gradual transitions include fade, dissolve, and wipe transitions. When an image gradually disappears into black or white or gradually appears from black or white, a fade transition occurs. When an image gradually disappears at the same time that another image gradually appears, a dissolve transition occurs. When a first image gradually blocks a second image, a wipe transition occurs. Gradual transitions are composite shots that are created from more than one shot.
A shot transition detector must be sensitive to both abrupt transitions and gradual transitions to be successful in an automatic video indexing system. The shot transition detector should also be insensitive to other changes. In other words, the shot transition detector should ignore small detail changes, image motion and camera motion. For example, panning, zooming and tilting should not significantly impact the query results.
Conventional video retrieval systems have employed several different types of cut transition detection techniques including histogram difference, frame difference, motion vector analysis, compression difference, and neural-network approaches. Frame differencing detection systems are extremely sensitive to local motion. Histogram detection systems successfully identify abrupt shot transitions and fades, but work poorly on gradual transitions such as wipe and dissolve. Motion vector detection systems require extensive computations that are prohibitive when large image databases are used. Neural-network detection systems do not provide improved performance over the other cut detection systems. Neural-network detection systems require significant computations for the neural-network training process.
Additional algorithms have been proposed that address gradual transitions such as fade, dissolve, and wipe transitions. Edge tracking systems measure the relative values of entering and exiting edge percentages. Edge tracking systems are able to correctly identify less than 20% of gradual transitions. Edge tracking systems require a motion estimation step to align consecutive frames which is computationally expensive. The performance of the edge tracking system is highly dependent upon the accuracy of the motion estimation step. Chromatic scaling systems assume that fade in transitions and fade out transitions are to and from black only. Chromatic scaling systems also assume that both object and camera motion are low immediately before and after the transition period.
A video segmentation system according to the invention includes a video source that provides a video sequence with a plurality of frames. The video segmentation system generates an S-distance measurement between adjacent frames of the video sequence. The S-distance measurement gauges the similarity between the adjacent frames.
A frequency decomposer that preferably employs wavelet decomposition generates a low frequency and a high frequency signature for each frame. A cut detector identifies cut transitions between two adjacent frames using the low frequency signature. A cut detector generates a difference signal between coefficients of the low frequency signature for adjacent frames and compares the difference signal to a threshold. If the difference signal exceeds the threshold, a cut transition is declared.
After identifying the cut transitions, the video segmentation system according to the invention employs a fade detector that identifies fade transitions using the high frequency signatures for frames located between the cut transitions. The fade detector includes a summing signal generator that sums the coefficients of the high frequency signature for each frame and compares the sum signal to a linear signal which is an increasing function for fade in and a decreasing function for fade out. A dissolve transition detector employs the high frequency signature to identify potential dissolve transitions. A double frame difference generator confirms the dissolve transitions. As can be appreciated, the video segmentation system according to the invention dramatically improves the identification of abrupt and gradual transitions. The video segmentation system achieves segmentation in a computationally efficient manner.
An image retrieval system according to the invention also employs an S-distance measurement to compare a query image to images located in a database. The S-distance measurement is used to allow a user to search and browse in a manner similar to text-based systems provided by the Internet.