The ability to identify named entities, such as people and locations, within a document, has been established as an important task in areas such as topic detection and tracking, machine translation, and information retrieval. A user may perform a search of the Web or another resource for a particular person, place, or other specific entity, by entering a string of text characters that constitutes an orthographic representation of a common name for the entity, as a query in a search engine. However, this string of text may also refer to other, unrelated entities or meanings that are irrelevant to the intended search results, while many relevant references may use other variations on the orthographic representation of the entity, and thereby be missed by the search engine given the particular string of text. For example, a search based on the string of text “George W. Bush” might return results that instead reference the earlier American president, George H. W. Bush, or George Bush International Airport in Houston, the aircraft carrier George H. W. Bush, other famous people, places, or entities with “Bush” in their name, or to a literal “bush” or shrub as a category of plant. A search may also miss alternate “surface forms”, or alternate orthographic references to the same intended entity, such as a reference with an abbreviated, alternate, casual, or other context-specific form of the name of the intended entity. For instance, alternate surface forms for George W. Bush that are used in various documents available on the Web may include “President Bush”, “Bush 43”, or even an orthographically unrelated term such as “Dubya”. Other Web-available references might reference Ronald Reagan as “The Gipper” or Abraham Lincoln as “Honest Abe”. Any of these documents might contain valuable information that would be desirable to include in search results for a search for the respective entities, but may be missed using a search for a string of text that represents the standard surface form for representing the respective entities. It would therefore be highly desirable to identify text references to particular named entities consistently with all the various surface forms in which such references may occur.
While the intended entity for an ambiguous surface form might be quite clear to many informed readers due to context in some instances, for example “Bush delivered his State of the Union address to Congress”, many other instances may be more ambiguous, e.g., “Bush delivered the commencement address at the university”. In the latter example, the surface form “Bush” may actually refer to former president George H. W. Bush or to former Florida governor Jeb Bush, for example, and readers unfamiliar with the event covered would not be able to resolve it correctly. In addition, with text content distributed around the world on the Internet, it is to be expected that many readers of any content might be from backgrounds that do not dispose them to understand the entire context that a writer would take for granted. An effective way to provide explicit disambiguation of ambiguous surface forms for specific entities would therefore fulfill a broad and persistent need.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.