When searching the web seeking to locate information related to a particular person, data about many persons having the same name are often retrieved by the search engine. To automatically disambiguate such persons, text clustering, which is directed towards finding groups within sets of data, can at times be a somewhat effective and practical technique. However, conventional text clustering mainly solves the problem of topic clustering, not person clustering or disambiguation.
In fact, traditional text clustering methods have many shortcomings when applied to person disambiguation, including that personal information is not well exploited, resulting in a number of challenges. For example, useful information relevant to a particular person is often very trivial, especially within the snippets retrieved by the search engine. While more ideal clues to distinctly identify a person might include concepts such as the person's organization, career, location, relationships with other persons, and so forth, such terms rarely occur more than one time in a short text segment. As a result, the text clustering results are often unexpectedly biased by other factors (which can be considered noise). Further, the cluster name is usually hard to understand with respect to the general goals of person disambiguation.
Moreover, some popular person names may be common among a relatively large number of different individuals. A normal iterative clustering algorithm, such as k-means, unavoidably increases analysis time. To enable an approach for online usage would require a fast, high-quality clustering algorithm.