Given a search query string, Web search engines have traditionally returned a list of hyperlinks that, upon selection, link to pages on the Web deemed relevant to the input search query. More recently, search engine results pages often also include richer content, usually via vertical information domains. As this trend continues, search will converge to a point where indexing and retrieval of information is performed not only with respect to Web pages but also with respect to abstract entities such as applications (for instance, from application marketplaces), movies, television shows, people, celebrities, events, cities, restaurants, theaters, companies, and the like. To surface entities, search engines must crawl multiple unstructured Web pages and/or subscribe to structured feeds regarding a particular entity type, resolve instances of an entity across this multi-source data, and surface a representation of the (merged) entity when a user's intent refers to the entity and/or its entity type. The complications associated with indexing and searching entities is compounded by the need to retrieve entities based on approximate descriptions, to retrieve broad sets of entities—some of which may not be described directly by the query string, to retrieve meta-data on an entity from a popular source based on its description in an unpopular source, in general to combine the features and ranks of indexed entities across multiple sources, to perform faceted search over entities, and in general to perform integrated search by integrating information from multiple web pages into a composite whole.
Prior art solutions to the entity search problem can be categorized into one of two approaches, each suffering from its respective disadvantages. First, vertical engine results pages (VERPs) that are specialized to a single information vertical, often search over collections of entities of a single type (e.g., movie entities) from an index containing basic entity attributes. Such solutions fail on queries that provide ambiguous descriptions or semantically relevant text that does not appear in the index (e.g., the query “movie with a sinking boat starring DiCaprio” may not return the movie “Titanic” or the query “Batman” may not return the movie “The Dark Knight.”) The second general approach uses Web search, which has the advantage of a large index of related terms that exploits Web link structure and anchor text, includes powerful intent analysis, and uses auto-spell correction. A disadvantage of this approach is that rich content as offered by a VERP may not be surfaced at all if indexed pages are not resolved with entities. And even if rich content is retrieved, numerous results linking to instances of the same basic entity may be retrieved together, diluting the diversity of results, since indexed pages are not resolved to one another.