Most physical and psychological concepts are associated with multiple attributes that are often in different domains. Coffee, for instance, is associated with a multitude of physical attributes, each involving a different set of our senses, such as color, smell, taste and temperature, as well as psychological attributes, such as joy. The human brain learns and creates the concept of “coffee” by correlating and associating all these attributes together. The more of these attributes that are present, the stronger the sense of “coffee” will be in our brain. Also, the brain allows us to tunnel between different domain perceptions associated with one concept; the smell of coffee, for instance, may create the perception of “joy,” before we even drink the cup. Such correlations cannot be explained unless one has knowledge of the concept of “coffee” through which this distinctive smell is linked to the joyous feeling.
The same idea applies to the domain of multimedia signals, where objects and concepts are usually associated with multiple attributes in text, audio and image domains. The word “laugh” is associated with several representations: a smiley face, white teeth, sound of laughter and the concept of “happiness.” Search for the keyword “Laughing” on Yahoo! Images returns the images in FIG. 18. But “laughter” and “laughing” are associated with a host of other concepts. Searching for “Happy” returns the images in FIG. 21. Clearly, there is a strong correlation between the visual contents of the two sets of images, as well as between the keywords describing them. However, the images and the text lie in two different signal domains.
In the above example, there are two domains (text and image domains) and three types of relationships: (1) between two attributes in the text domain or, (2) between two attributes in image domain or, (3) cross domain relationships between attributes in image and text domains. We know that the two phrases “Laughing” and “Happiness” are conceptually related. This relationship can be discovered using, say, a lexicographic dictionary such as WordNet, a tool specific for the text domain. The image domain relationships can be discovered using an image correlation method. Thus, the intra-domain correlations can be discovered using domain-specific analysis tools. The inter-domain relationships, however, have to be learned by examples. After all, one cannot compare apples and oranges. Also, new intra-domain relationships may emerge based on inter-domain relationships. For example, the intra-domain relationship between an image of a birdhouse and an image of a Blue Jay is established via text-to-image cross-domain relationships (the blue dotted line in FIG. 23). However, to be able to learn a myriad of such cross-domain relationships that exist across multimedia signals, one needs a really huge set of examples.
Further, a related and longstanding goal in artificial intelligence (AI) is to enable content-based, automated querying of multimedia signals, such as object recognition in images and video, or speaker independent speech recognition. Once again, a major obstacle in attaining this goal is the lack of sufficient number of training examples to train AI classifiers. For certain classes of tasks, such datasets of examples have been collected manually. Examples include databases for face detection, pedestrian detection, or the like. This method, however, does not scale to the “Internet scale.” The state-of-the-art classifiers require thousands of positive examples that need to be carefully segmented. Manual collection of thousands of training images for each of the nearly 10,000 common objects is prohibitive. The same limitations apply to speaker-independent speech recognition, where one requires examples of the pronunciations of each word in the dictionary by hundreds of speakers.
The required training data and cross-domain examples, however, is available in raw form on the Web or other unstructured datasets, such as movie archives. The Web now contains millions of freely available audio and video clips and images. These abundant examples, however, are at best loosely annotated by textual descriptions. These loose annotations have been used to enable multimedia searches in the Web that work to some extent (e.g. the above mentioned Yahoo! Images search engine). For instance, to locate an image corresponding to an object X, those images that are annotated with the metadata X are returned. Examples include those images whose URLs contain the term X (e.g., X.jpg) or whose captions contain the term X. In our above example of “laugh” concept, Yahoo! Images provides us with the required examples. FIG. 22 shows the collection of images downloaded from Yahoo! Images, Set 1 corresponding to “Happy” and Set 2 corresponding to “Laughing” with the left side of the Figure showing the text phrases and the right side showing images. Internet users have tagged the images in “Set 1” with the term “Happy,” while the ones in Set 2 have been tagged with “Laughing.” This tagging is usually implicit, for instance, the name of the image file may by happ_kid.jpg or the text most probably describing the image may contain the phrase “happy.”
Similarly, the first 12 results from Yahoo! Images when searching for the term “Spoon” are shown in FIG. 29. Note that all these images contain the term “spoon” in the name of the file. Clearly, a good fraction of these images indeed contain the object “spoon.” However, still a good fraction of the images do not contain any image of an actual spoon. Another fraction of images contain a “spoon” at an unknown location, along with other objects. Even though each and every individual image in this collection cannot be trusted to be a spoon, the likelihood of finding a spoon is significantly larger than in a random collection of images. This disproportionate presence can be detected by an appropriate method to establish what constitutes to the image of a “spoon” without the need for manual intervention.
Thus, the abundance of loosely annotated data along with innovative domain-specific tools can indeed be harnessed to establish intra-domain as well as cross-domain relationships and ultimately to understand the multimedia entities. It is an object of this disclosure to provide a unified framework for this purpose as well as to present a method and system to achieve this goal.