1. Technical Field
The invention relates to the analysis of information. More particularly, the invention relates to a method and apparatus for automatic entity disambiguation.
2. Description of the Prior Art
Spoken and written text consists of characters, words, names, sentences, documents, conversations, and so on, but the world that the text describes consists of distinct objects and events. People now have access to an enormous amount of text, but are generally ultimately interested in actual persons and organizations that interact in the real world. Entity disambiguation, sometimes also referred to as entity tracking, is the process of determining which names, words, or phrases in text correspond to distinct persons, organizations, locations, or other entities. This determination is absolutely essential for reasoning, inference, and the examination of social network structures based on information derived from text.
We use the term entity to mean an object or set of objects in the world. A mention is a reference to an entity, such as a word or phrase in a document. Entities may be referenced by their name, indicated by a common noun or noun phrase, or represented by a pronoun. Mentions may aggregate with other mentions that refer to the same specific real-world object, and, taken together, mentions that refer to the same specific real-world object, and, taken together, the aggregated mentions model an entity. These corpus-wide aggregated models of entities are of primary importance, while the individual mentions of an entity are still of secondary importance (see Mitchell, A.; Strassel, S.; Przybocki, P.; Davis, J. K.; Doddington, G.; Grishman, R.; Meyers, A.; Brunstein, A.; Ferro, L. and Sundheim, B. 2004. Annotation Guidelines for Entity Detection and Tracking (EDT), Version 4.2.6. http://www.ldc.upenn.edu/Projects/ACE/).
Entity disambiguation inherently involves resolving many-to-many relationships. Multiple distinct strings, such as “Abdul Khan,” “Dr. Khan,” and “‘Abd al-Qadir Khan,” may refer to the same entity. Simultaneously, multiple identical mentions refer to distinct entities. For example, literally tens of thousands of men share the name “Abdul Khan.”
Consider the following sentences from a corpus of news text:                “Young pacer Yasir Arafat chipped in with two late wickets to finish with two for 36 in 18 overs.”        “General Musharraf also apprised Yasir Arafat of the outcome of his negotiations with the Indian Prime Minister Atal Behari Vajpayee at Agra.”        “Palestinians demonstrated in Gaza City Sunday in support of Palestinian leader Yasser Arafat.”        “Makkah has announced that the Arafat Gathering (9 Zul-Hajj) will be on Sat 31 Jan. 2004.”        
These can be confusing even to a human reader. The first underlined name refers to a Pakistani cricket player, the next two refer to the late Palestinian leader, and the last refers to a place near Mecca, Saudi Arabia. The job of the entity disambiguation system is to automatically assign these four mentions to three distinct entities, correctly grouping only the second mention of “Yasir Arafat” with “Yasser Arafat”.
It would be advantageous to provide an entity disambiguation technique that correctly resolves the above mentions in the context of the entire corpus.