1. Technical Field
A “Name Disambiguator” provides various techniques for implementing an interactive framework for resolving or disambiguating entity names for entity searches where two or more same or similar names may refer to different entities.
2. Background Art
Entity searches (e.g., names of specific people, places, businesses, etc.) are becoming more and more common on the Internet as increasing numbers of people around the world search for specific entities and information relating to those entities. Unfortunately, name ambiguity in both publications and web pages is a problem that affects the quality of entity searches.
In general, two types of name ambiguities are considered. The first type of name ambiguity is where the same name string refers to different entities in the real world, due to the fact that many people share the same name. For example, “Lei Zhang” can refer to a researcher from Microsoft® Research Asia, or a different person from IBM® research having the exact same name. The second type of name ambiguity is that different name strings refer to the same person, because of the abbreviation, pseudonyms, the use or omission of middle names or initials, etc. For example, “Michael I. Jordan” also appears as “Michael Jordan” in many web pages or publications and both of them refer to a professor at UC Berkeley. This particular name ambiguity problem is further complicated by the fact that “Michael Jordan” also refers to a famous basketball player (i.e., the first type of name ambiguity noted above).
While a number of conventional schemes have been implemented in an attempt to address the disambiguation problem, there has been only limited success in this field. In fact, it has been observed that no known digital library of significant scope can provide a completely correct publication list for every researcher. For example, many publication lists contain papers of multiple researchers who have the same or similar name. Name ambiguities have an even worse effect on searching generic web pages. For example, when a web search “Lei Zhang” is performed on a typical search engine, that search engine will typically return a very large number of web pages which refer to hundreds different persons. Consequently, the user is left to struggle to think up additional keywords to refine the results, which are usually still not satisfactory.
Examples of fully automated conventional models that have been used in various attempts to solve the disambiguation problem include the use of Bayesian networks, support vector machines (SVM), affinity propagation, Markov Random Fields (MRF), etc. Unfortunately, no known fully automated models can achieve near 100% accuracy in each case because the variations of the names are too complicated. Consequently, it can be said that the previous work has proved that a single fully automated model fails to leverage all aspects and address all cases to provide name disambiguation at or near 100% accuracy.
More specifically, various attempts have been made to solve the name disambiguation problem for specific areas of interest, such as web names, authors of citations, names in email, etc. Most conventional schemes have been enacted by formalizing the name disambiguation task as a clustering problem that uses fully automatic models. For example, one such technique for author name disambiguation clusters documents into atomic groups in a first step and then merges the groups. It was observed that the use of atomic groups helped the performance of existing clustering-based methods. Another such technique uses a similar two stage clustering, where the first stage uses “strong features” such as compound key words and entity names to cluster web pages. These results were then further clustered in the second stage using “weak features” such as publication topics. Unfortunately, both of these two stage schemes use automatic models that do not control the quality of the results in the first stage, thereby degrading the quality of the final results.
In fact, a comparative study of many existing 2-stage clustering methods was conducted that primarily compared different distance measures with various conventional supervised and unsupervised clustering methods. One such method evaluated by the study applied two supervised models, a naive Bayes model and support vector machines, to solve the disambiguation problem. Another studied method used two unsupervised frameworks for solving the disambiguation problem, where one framework was based on the link structure of Web pages and the second framework used agglomerative/conglomerative double clustering. Unfortunately, as noted above, such schemes use automatic models that do not control the quality of the results in the first stage, thereby degrading the quality of the final results.
Several conventional schemes have also focused on using external data in an attempt to solve or improve the name disambiguation problem. For example, one such scheme made use of Wikipedia® pages associated with particular authors or topics to disambiguate named entities. This scheme extracted “features” from Wikipedia® for use in a supervised learning process. Unfortunately, since not every author entity is covered by a Wikipedia® page or other Internet source, such schemes cannot guarantee accuracy for disambiguating the names of all authors or other entities.