The present invention relates to information retrieval systems generally, and more particularly to disambiguation of named entities within documents and queries.
Searches for named entities are among the most common searches on the Web. These types of searches include searches for persons, places (including geographical locations as well as cities, states, countries, etc.), businesses and other organizations, products, books, movies, and so forth. Generally, a named entity is anything that has a proper noun (or noun phrase) or proper name associated with it. A search for a named entity typically returns a set of search results that have relevant information about any entity with the same name (or even a portion thereof) as the query. Thus, a query for “Long Beach” is likely to return documents about the coastal city in Long Island, N.Y. as well as documents about the coastal city in Southern California, as well as documents that are relevant to the terms “long” and “beach”. Similarly, a query for “John Williams” will return documents about the composer as well as documents about the wrestler, and the venture capitalist, all of whom share this name; a query for “Python” will return documents pertaining to the programming language, as well as to the snake, and the movie. The underlying problem then is that queries for named entities are typically ambiguous, and may refer to different instances of the same class (e.g., different people with the same name), or to things in different classes (e.g., a type of snake, a programming language or a movie).
Search results for a named entity are typically ordered according to the frequency of the query terms, page rank, or other factors, without consideration of the different senses of the query (e.g., the different entities to which the name refers). The search results pertaining to the different entities tend to be mixed together. Further, even though the user is typically searching for a document (page) that best describes the named entity (or the different entities of the same name), the search results may not necessarily include or rank such a document very highly, again because the search system did not identify the different senses of the name.