A challenge in searching for information about people and other entities in large document sets, such as the Internet, is recognizing an entity and disambiguating that entity from others. Entities include, but are not limited to, people, organizations, locations and the like and typically are represented in language using a proper noun. Often, a proper noun phrase is ambiguous, and may represent several different entities. The entity which is most likely being represented is disambiguated based on context.
Most search engines, especially those generally available over the internet, do not provide any disambiguation and simply return to the user a list of documents that contain query terms. This kind of result requires the user to sort out which documents are relevant. For example, a search for “Michael Jordan” can provide results about a basketball player or a statistics professor. A search for “Michael Smith” can find documents related to any of thousands of people.
Some systems attempt to disambiguate entities by clustering document sets based on the context in which an entity appears. For example, in a set of documents containing the words “Michael Jordan,” all documents that contain similar basketball related words might be grouped together to represent one “Michael Jordan,” while all documents that contain words related to statistics and machine learning might be grouped together to represent another “Michael Jordan.”
Other systems attempt to disambiguate entities by reference to one or more external dictionaries of entities. In such systems, an entity's context is compared to possible matching entities in the dictionary and the closest match is returned. For example, documents about the business activities of Michael Jordan and documents about the basketball career of Michael Jordan could both be matched to the same Michael Jordan in the dictionary, even though the two sets of documents may not have many terms in common with each other.
In both clustering-based systems and dictionary-based systems, a variety of context based information can be used to disambiguate entities in documents, such as: whether documents are on the same web site, other words in the documents, inferred relationships with other entities, document similarity metrics, and the like. For example, the relationship of an entity to other entities can serve to disambiguate one entity from another. For example, if a document includes a reference to one person, e.g., “Michael Jordan,” and also refers to another entity, e.g., “Chicago Bulls” as his team, then a “Michael Jordan” in another document also referring to “Chicago Bulls” can be considered the same “Michael Jordan.”
A problem associated with clustering-based techniques is that sometimes contextual information needed to disambiguate entities is not present in the context, leading to incorrectly disambiguated results. For example, documents about the same entity in different contexts may not be clustered together even though they refer to the same entity. For example, Michael Jordan the basketball player is also an active businessperson. Documents about his business activities might not be clustered together with documents about his basketball career, despite the fact that both clusters of documents are referring to the same Michael Jordan. Similarly, documents about different entities in the same or superficially similar contexts may be incorrectly clustered together. For example, documents about the statistics professor Michael Jordan might be incorrectly clustered together with documents about the basketball statistics of Michael Jordan the basketball player.
A problem associated with current dictionary-based techniques stems from the fact that no dictionary can contain a complete representation of the world's entities. Thus, if a document's context is matched to an entity in the dictionary, then the technique has identified only the most similar entity in the dictionary, and not necessarily the correct entity, which may be outside the dictionary.