Searching for information about entities (i.e., people, locations, organizations) in large document collection, including sources such as a network, may often be ambiguous, which may lead to imprecise text processing functions, imprecise association of features during knowledge extraction, and, thus, imprecise data analysis.
State of the art systems use linkage based clustering and ranking in several algorithms, such as PageRank and the hyperlink-induced topic search (HITS) algorithm. The basic idea behind this and related approaches is that pre-existing links typically exist between related pages or concepts. A limitation of clustering-based techniques is that sometimes contextual information needed to disambiguate entities is not present in the context, leading to incorrectly disambiguated results. Similarly, documents about different entities in the same or superficially similar contexts may be incorrectly clustered together.
Other systems attempt to disambiguate entities by reference to one or more external dictionaries (or knowledgebase) of entities. In such systems, an entity's context is compared to possible matching entities in the dictionary and the closest match is returned. A limitation associated with current dictionary-based techniques stems from the fact that entities may increase in number at any moment and, therefore, no dictionary may include a representation of all of the world's entities. Thus, if a document's context is matched to an entity in the dictionary, then the technique has identified only the most similar entity in the dictionary, and not necessarily the correct entity, which may be outside the dictionary.
Most methods just use entities and key phrases in the disambiguation process. Therefore, there is still a need for accurate entity disambiguation techniques that allow a precise data analysis.