Searching information about entities (i.e., people, locations, organizations) in a large amount of documents, including sources such as a network, may often be ambiguous, which may lead to imprecise text processing functions, imprecise association of features during a knowledge extraction, and, thus, imprecise data analysis.
State of the art systems use linkage based clustering and ranking in several algorithms like PageRank and the hyperlink-induced topic search (HITS) algorithm. The basic idea behind this and related approaches is that pre-existing links typically exist between related pages or concepts. A limitation of clustering-based techniques is that sometimes contextual information needed to disambiguate entities is not present in the context, leading to incorrectly disambiguated results. Similarly, documents about different entities in the same or superficially similar contexts may be incorrectly clustered together.
Other systems attempt to disambiguate entities by reference to one or more external dictionaries (or knowledge base) of entities. In such systems, an entity's context is compared to possible matching entities in the dictionary and the closest match is returned. A limitation associated with current dictionary-based techniques stems from the fact that entities may increase its number by each moment and, therefore, no dictionary may include a representation of all of the world's entities. Thus, if a document's context is matched to an entity in the dictionary, then the technique has identified only the most similar entity in the dictionary, and not necessarily the correct entity, which may be outside the dictionary.
Traditional search engines allow users to find just pieces of information that are relevant to an entity, and while millions or billions of documents may describe that entity the documents are generally not linked together. In most cases it may not be viable to try to discover a complete set of documents about a particular feature. Additionally, methods that pre-link data are limited to a single method of linking and are fed by many entity extraction methods that are ambiguous and are not accurate. These systems may not be able to use live feeds of data; they may not perform these processes on the fly. As a consequence the latest information is not used in the linking process.
One limitation of fully automated linking is that, the results provided by this type of systems are as good as the data coming into it. Therefore, if inaccurate data is provided to the system, inaccurate results may be provided.
Therefore, there is still a need of accurate entity disambiguation techniques that allows a precise data analysis.