Large databases containing image-rich web pages and associated text are now common. Typically, images are associated with the text surrounding the images on web pages or the text in tags that users have associated with the images. Models can be built based on databases containing information about images and text to model the associations between the images and text to perform searches on the images.
For example, a database contains information that image I is associated with text string T. A model is built from the database that contains this information. When a user queries for images with a query string T, the model determines that image I is associated with text string T and provides image I as the query result to the user. If there are other images that are associated with text string T, these associated images may also be provided to the user as query results.
The modeling of associations between images and the associated text, also known as keywords, contains two components. The first component is image representation, where the images are represented as a collection of “visual words”. A visual word is a description of a feature or characteristic of a particular image. For example, an image of a living room that contains a lamp and a coffee table can be represented as a collection of visual words. For example, the set of visual words that represent the living room may include two subsets: a subset of visual words that corresponds to a lamp and another subset of visual words that correspond to a coffee table.
There are multiple ways of representing an image as a collection of visual words. One way is to represent the image as “blobs”, where each “blob” is described by feature color and texture vectors. Representing images as “blobs” is described in detail in K. Barnard, et al., “Matching Words and Pictures,” Journal of Machine Learning Research, 2003. Another way of representing an image as a collection of visual words is to represent the image as a collection of “salient points”, as described by A. Bosch, et al., “Scene Classification via pLSA”, European Conference on Computer Vision, 2006. Salient points can be detected using several techniques, some of which are described in C. Schmidt et al., “Evaluation of Interest Point Detectors,” International Journal of Computer Vision, 2000. Once detected, a salient point can be represented as a SIFT (Scale Invariant Feature Transform) vector. This representation of salient points using SIFT is described in further detail in D. G. Lowe, “Distinctive image features from scale-invariant keypoints”.
The second component of modeling image and keyword associations is the building of a statistical model. Statistical models employing the use of hidden, or latent, variables have been used to model the statistical relationships between the collection of visual words that represent images, and a collection of keywords which are associated with the images. Several statistical model using latent variables have been developed, including: PLSA (Probabilistic Latent Semantic Indexing), as described in T. Hoffman, “Probabilistic Latent Semantic Analysis,” Proceedings of Uncertainty in Artificial Intelligence, UAI'99, 1999, Latent Dirichlet Allocation, as described in D. Blei, et al., “Latent Dirichlet Allocation”, NIPS, 2002, and Correspondence LDA as described in D. Blei and M. Jordan, “Modeling Annotated Data,” ACM SIGIR Conference, 2003.
However, statistical models that employ the use of latent variables are limiting because these models use a bottleneck approach—image representations are further broken down by these models into a small number of latent variables, and statistical associations are made between keywords and latent variables. These models also suffer from the drawback that estimation of latent variables is often very complex.
Furthermore, current statistical models primarily provide unidirectional associations from images to keywords or from keywords to images, limiting the derivation of implicit associations among images and words.
Therefore, there is a need for a way to jointly model image-keyword associations inclusively to allow free and unlimited associations between images and keywords. Furthermore, the model should provide bidirectional associations between images and text.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.