1. Field of the Invention
This invention relates to techniques for searching and retrieving visual information, and, more particularly to the use of content-based search queries to search for and retrieve moving visual information.
2. Description of Related Art
During the past several years, as the Internet has reached maturity and multimedia applications have come into wide spread use, the stock of readily available digital video information has become ever increasing. In order to reduce bandwidth requirements to manageable levels, such video information is generally stored in the digital environment the form of compressed bitstreams that are in a standard format, e.g., JPEG, Motion JPEG, MPEG-1, MPEG-2, MPEG-4, H.261 or H.263. At the present time, hundreds of thousands of different still and motion images, representing everything from oceans and mountains to skiing and baseball, are available over the Internet.
With the increasing wealth of video information available in a digital format, a need to meaningfully organize and search through such information has become pressing. Specifically, users are increasingly demanding a content based video search engine that is able to search for and retrieve specific pieces of video information which meet arbitrary predetermined criteria, such as shape or motion characteristics of video objects embedded within the stored video information, in response to a user-defined query.
In response to this need, there have been several attempts to develop video search and retrieval applications. Existing techniques fall into two distinct categories: query by example (xe2x80x9cQBExe2x80x9d) and visual sketching.
In the context of image retrieval, examples of QBE systems include QBIC, PhotoBook, VisualSEEk, Virage and FourEyes, some of which are discussed in T. Minka, xe2x80x9cAn Image Database Browser that Learns from User Interaction,xe2x80x9d MIT Media Laboratory Perceptual Computing Section, TR #365 (1996). These systems work under the pretext that several satisfactory matches must lie within the database. Under this pretext, the search begins with an element in the database itself, with the user being guided towards the desired image over a succession of query examples. Unfortunately, such xe2x80x9cguidingxe2x80x9d leads to substantial wasted time as the user must continuously refine the search.
Although space partitioning schemes to precompute hierarchical groupings can speed up the database search, such groupings are static and require recomputation when a new video is inserted into the database. Likewise, although QBE is, in principle, extensible, video shots generally contain a large number of objects, each of which is described by a complex multi-dimensional feature vector. The complexity arises partly due to the problem of describing shape and motion characteristics.
The second category of search and retrieval systems, sketch based query systems, compute the correlation between a user-drawn sketch and the edge map of each of the images in the database in order to locate video information. Sketch based query systems such as the one described in Hirata et al., xe2x80x9cQuery by Visual Example, Content Based Image Retrieval, Advances in Database Technologyxe2x80x94EDBT,xe2x80x9d 580 Lecture Notes on Computer Science (1992, A. Pirotte et al. eds.), compute the correlation between the sketch and the edge map of each of the images in a database. In A. Del Bimbo et al., xe2x80x9cVisual Image Retrieval by Elastic Matching of User Sketches,xe2x80x9d 19 IEEE Trans. on PAMI, 121-132 (1997), a technique which minimizes an energy functional to achieve a match is described. In C. E. Jacobs, et al., xe2x80x9cFast Miltiresolution Image Querying,xe2x80x9d Proc. of SIGGRAPH, 277-286, Los Angeles (August 1995), the authors compute a distance between the wavelet signatures of the sketch and each of the images in the database.
Although some attempts have been made to index video shots, none attempt to represent video shots as dynamic collection of video objects. Instead, the prior techniques have utilized image retrieval algorithms for indexing video simply by assuming that a video clip is a collection of image frames.
In particular, the techniques developed by Zhang and Smoliar as well as the ones developed at QBIC use image retrieval methods (such as by using color histograms) for video. A xe2x80x9ckey-framexe2x80x9d is chosen from each shot, e.g., the r-frame in the QBIC method. In the case of Zhang and Smoliar, the key frame is extracted from a video clip by choosing a single frame from the clip. The clip is chosen by averaging over all the frames in the shot and then choosing the frame in the clip which is closest to the average. By using conventional image searches, such as a color histogram search, the key frames are used to index video.
Likewiese, in the QBIC project, the r-frame is selected by taking an arbitrary frame, such the first frame, as the representative frame. In case the video clip has motion, the mosaiked representation is used as the representative frame for the shot. QBIC again uses their image retrieval technology on these r-frames in order for them to index video clips.
In order to index video clips, the Informedia project creates a transcript of video by using a speech recognition algorithm on the audio stream. Recognized words are aligned with the video frame where the word was spoken. A user may search video clips by doing a keyword search. However, the speech to text conversion proved to be a major stumbling block as the accuracy of the conversion algorithm was low (around 20-30%), a significant impact on the quality of retrieval.
The above-described prior art devices fail to satisfy the growing need for an effective content based video search engine that is able to search for and retrieve specific pieces of video information which meet arbitrary predetermined criteria. The techniques are either incapable of searching motion video information or search such information only with respect to a global parameter such as panning or zooming. Likewise the prior art techniques fail to describe techniques for retrieving video information based on spatial and temporal characteristics. Thus, the aforementioned existing techniques cannot search for and retrieve specific pieces of video information which meet arbitrary predetermined criteria such as shape or motion characteristics of video objects embedded within the stored video information, in response to a user-defined query.
An object of the present invention is to provide a truly content based video search engine.
A further object of the present invention is to provide a search engine which is able to search for and retrieve video objects embedded in video information.
Another object of the invention is to provide a mechanism for filtering identified video objects so that only objects which best match a user""s search query will be retrieved.
Yet another object of the present invention is to provide a video search engine that is able to search for and retrieve specific pieces of video information which meet arbitrary predetermined criteria in response to a user-defined query.
A still further object of the present invention is to provide a search engine which is able to extract video objects from video information based on integrated feature characteristics of the video objects, including motion, color, and edge information.
In order to meet these and other objects which will become apparent with reference to further disclosure set forth below, the present invention provides a system for permitting a user to search for and retrieve video objects from one or more sequences of frames of video data over an interactive network. The system advantageously contains one or more server computers including storage for one or more databases of video object attributes and storage for one or more sequences of frames of video data to which the video object attributes correspond, a communications network permitting transmission of the one or more sequences of frames of video data from the server computers, and a client computer. The client computer houses a query interface to receive selected video object attribute information, including motion trajectory information; a browser interface receiving the selected video object attribute information and for browsing through stored video object attributes within the server computers by way of the communications network, to determine one or more video objects having attributes which match, within a predetermined threshold, the selected video object attributes; and also an interactive video player receiving one or more transmitted sequences of frames of video data from the server computers which correspond to the determined one or more video objects.
In a preferred arrangement, the databases stored on the server computers include a motion trajectory database, a spatio-temporal database, a shape database, a color database, and a texture database. The one or more sequences of frames of video data may be stored on the server computers in a compressed format such as MPEG-1 or MPEG-2.
The system also may include a mechanism for comparing each selected video object attribute to corresponding stored video object attributes within the server computers, in order to generate lists of candidate video sequences, one for each video object attribute. Likewise, a mechanism for determining one or more video objects having collective attributes which match, within a predetermined threshold, the selected video object attributes based on the candidate lists are beneficially provided. The system also includes a mechanism for matching the spatial and temporal relations amongst multiple objects in the query to a group of video objects project in the video clip.
In accordance with a second aspect of the present invention, a method for extracting video objects from a sequence of frames of video data which include at least one recognizable attribute is provided. The method calls for quantizing a present frame of video data by determining and assigning values to different variations of at least one attribute represented by the video data to generate quantized frame information; performing edge detection on the frame of video data based on the attribute to determine edge points in the frame to thereby generate edge information; receiving one or more segmented regions of video information from a previous frame, and extracting regions of video information sharing the attribute by comparing the received segmented regions to the quantized frame information and the generated edge information.
Preferably, the extracting step consists of performing interframe projection to extract regions in the current frame of video data by projecting one of the received regions onto the current quantized, edge detected frame to temporally track any movement of the region; and performing intraframe segmentation to merge neighboring extracted regions in the current frame under certain conditions. The extracting step may also include labeling all edges in the current frame which remain after intraframe segmentation to neighboring regions, so that each labeled edge defines a boundary of a video object in the current frame.
In a particularly preferred technique, a future frame of video information is also received, the optical flow of the present frame of video information is determined by performing hierarchical block matching between blocks of video information in the current frame and blocks of video information in the future frame; and motion estimation on the extracted regions of video information is performed, by way of determining an affine matrix, based on the optical flow. Extracted regions of video information may be grouped based on size and temporal duration, as well as on affine models of each region.
In yet another aspect of the present invention, a method for locating a video clip which best matches a user-inputted search query from a sequence of frames of video data that include one or more video clips, where the video clip includes a video object temporally moving in a predetermined trajectory, is provided. The method advantageously includes receiving a search query defining at least one video object trajectory; determining the total distance between the received query and at least a portion of one or more pre-defined video object trajectories; and choosing one or more of said defined video object trajectories which have the least total distance from the received query to locate the best matched video clip or clips.
Both the search query and pre-defined video object trajectories may be normalized. The query normalizing step preferably entails mapping the received query to each normalized video clip, and scaling the received mapped query to each video object trajectory defined by the normalized video clips. The determining step is realized either by a spatial distance comparison, or a spatio-temporal distance comparison.
In still another aspect of the present invention, a method for locating a video clip which best matches a user-inputted search query from one or more video clips, where each video clip comprises one or more video objects each having predetermined characteristics, is provided. This method includes receiving a search query defining one or more characteristics for one or more different video objects in a video clip; searching the video clips to locate video objects which match, to a predetermined threshold, at least one of said defined characteristics; determining, from the located video objects, the video clips which contain the one or more different video objects; and determining a best matched video clip from the determined video clips by calculating the distance between the one or more video objects defined by the search query, and the located video objects. The characteristics may include color, texture, motion, size or shape.
In a highly preferred arrangement, the video clips include associated text information and the search query further includes a definition of text characteristics corresponding to the one or more different video objects, and the method further includes the step of searching the associated text information to locate text which matches the text characteristics. Then, the best matched video clip is determined from the determined video clips and the located text.