1. Field of the Invention
The present invention is directed to the field of multimedia data retrieval. It is particularly directed toward a method and system which enable a user to query a multimedia archive in one media modality and automatically retrieve correlating data in another media modality, without the need for manually associating the data items through a data structure.
2. Description of the Related Art
Over the past decade, the number of multimedia applications has grown exponentially and the volume of multimedia content has continued to soar. Enhanced computing power, the growth of the World Wide Web, and the availability of more compact and inexpensive storage media have fueled this growth. Naturally, an increased interest in multimedia content-based retrieval has also resulted, reflecting these phenomena.
However, existing approaches to retrieving multimedia content are limited. For example, in order to query a multimedia database to retrieve an image, the query must take the form of an image. It is not possible, to retrieve a picture of a waterfall, for example, using the sound of a waterfall as the query. Retrieval continues to be limited to a single multimedia domain, except for rudimentary cross-media retrieval by keyword.
U.S. patent application Ser. No. 10/076,194 describes a system and method for associating facial images with speech, without the need for face recognition. An object detection module provides a plurality of object features from the video face data and an audio segmentation module provides a plurality of audio speech features related to the video. The latent semantic indexing (LSI) technique is used to correlate the object features and to locate the face that is doing the speaking in the video. This application does not describe data retrieval and deals only with audio and video modalities.
U.S. Pat. No. 6,154,754 to Hse et al., entitled Automatic Synthesis of Semantic Information From Multimedia Documents, discloses a system for building hierarchical information structures for non-textual media. The pieces of information that are extracted from textual and nontextual media are termed AIUs (Anchorable Information Units) and are both represented in Standard Generalized Markup Language (SGML), so they can be processed in the same manner. An AIU object is a sequence of one or more parsable character strings or ASCII strings. The '754 patent is directed at linking textual and non-textual media documents, based upon a textual conversion, and does not address retrieval of video segments, for example.
European Patent Application No. EP 1 120 720 A2 to Ball et al., entitled User Interface for Data Presentation Systems, discloses a method for enhancing user interfaces. The user may present the user's query in a natural language format, as text, speech or point and click, and the method translates the query to a standard database query for retrieving text. If the natural language query cannot be effectively converted, the method supplies the user with additional information and continues to prompt the user for a query. This application does not address cross-modality retrieval of information.
International Patent Publication Number WO 00/45307 A1 entitled Multimedia Archive Description Scheme discloses a description scheme for a collection of multimedia records. The scheme relates records using a data structure called a cluster. The cluster is formed by evaluating the attributes of the record descriptions for similarity. Clusters can be grouped to form other clusters. Examples of clusters are Art, History, Expressionist, Impressionist. Cluster information must be stored for each record and limits the type of query which can retrieve a particular record.
United States Patent Application Publication No. U.S. Ser. No. 2001/0028731 A1, entitled Canonical Correlation Analysis of Image/Control-Point Location Coupling for the Automatic Location of Control Points, discloses a method for deriving hidden data, (control points), based upon observable data. Groups of control points are used to locate a feature of interest, such as a mouth, and could be located at the corners of the mouth, at the inner and outer edges of the lips, and at the centers thereof. The system discloses how to generate a model to locate these control points on unmarked images. The system is a single media modality system and does not retrieve data.
U.S. Pat. No. 6,343,298 B1 to Savchenko, et al. Entitled Seamless Multimedia Branching, discloses a method of authoring multimedia titles and storing multimedia content that implements seamless branching on digital media with high seek latency and a fixed upper bound on this latency. Continuous media content is arranged as individual clips on a storage medium and seamless branches between the clips are identified by an author. Individual clips are identified as carrier clips or non-carrier clips to guarantee seamlessness and to optimize memory usage and the availability of seamless jumps. Bridge data of a particular target media clip is interleaved or otherwise associated on the storage medium with a carrier clip that is upstream of the target media clip, and delivered along with the upstream media clip. This is not an automatic system and does not employ a statistical methodology.
Thus, there exists a need in the art for a cross-modality system which can automatically retrieve a media object in one modality that is related to a media object in a second modality without storing an association between the objects. What is needed is a means for seamlessly browsing heterogeneous multimedia content along with the ability to integrate different media sources based upon their semantic association.