Many search engine services, such as Google and Overture, provide for searching for information that is accessible via the Internet. These search engine services allow users to search for display pages, such as web pages, that may be of interest to users. After a user submits a search request (also referred to as a “query”) that includes search terms, the search engine service identifies web pages that may be related to those search terms. To quickly identify related web pages, a search engine service may maintain a mapping of keywords to web pages. The search engine service may generate this mapping by “crawling” the web (i.e., the World Wide Web) to extract the keywords of each web page. To crawl the web, a search engine service may use a list of root web pages and identify all web pages that are accessible through those root web pages. The keywords of any particular web page can be extracted using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on. The search engine service may calculate a relevance score that indicates how relevant each web page is to the search request based on the closeness of each match, web page popularity (e.g., Google's PageRank), and so on. The search engine service then displays to the user the links to those web pages in an order that is based on their relevance. Search engines may more generally provide searching for information in any collection of documents. For example, the collections of documents could include all U.S. patents, all federal court opinions, all archived documents of a company, and so on.
The search engine services may need to measure the similarity between various objects such as web pages or queries. For example, a search engine service may allow for interactive query expansion, which requires a similarity calculation between query terms and other terms. As another example, a search engine service may want to group web pages into clusters of similar web pages to assist a user in navigating through the web pages. Typical algorithms for determining the similarity of objects generally use a feature vector relating to the objects and then calculate the distance between the feature vectors as an indication of similarity. For example, web pages may have features that include keywords, content, and so on that are used to calculate the similarity. Most algorithms rely solely on the features associated with the objects when determining similarity. For example, the similarity between web pages may be based solely on the content of the web pages. A few algorithms, however, factor in features that are based on heterogeneous objects. For example, one algorithm uses click-through data in which queries are similar if they contain the same terms or lead to selection of the same web page by users. Thus, the feature vector for such queries would include information on web pages of the query result that were selected by users.
These techniques, however, when calculating the similarity between objects of a type fail to take into consideration the similarity among objects of another type that may be related. That is, the similarity measurements for objects of a type may be related to the similarity measurements for objects of another type. For example, a query may be similar to another query based, in part, on the similarity between the web pages of the results which users select or click through. Conversely, web pages may be similar to another based, in part, on the similarity between the queries that return the web pages in their results. It would be desirable to have a technique for measuring the similarity of objects that factors in relationships between heterogeneous objects.