Coreference resolution is a recurring linguistic and technical problem for which there have not been satisfactory solutions in the past. I use the term ‘coreference resolution’ to reflect the notion that a noun phrase is cohesively linked to a previously occurring item. The noun phrase that refers backward is the anaphor and the previously occurring item is its antecedent.
In information extraction, consider an extraction system that recovered the perpetrators of terrorist acts. It would be much more valuable to have this system generate a list of people names rather than the he's and she's that are often the explicit mentions. In information retrieval, when performing a web search on George Bush, an advanced ranking algorithm would take into account the number of references to George Bush in each web page, including he, him, and the President. In text classification, a user might want news articles classified by what actions Chuck Yeager took, but without resolving he, him, and the pilot with Yeager, the classification algorithm is apt to miss valuable clues.
Note that none of these tasks is impossible without coreference resolution. On the contrary, all of them exist today in some useful form without the resolutions of anaphors. What was becoming clear, however, is that this one linguistic phenomenon impacts a wide range of natural language processing (“NLP”) tasks, and developing a computational treatment of coreference could have potentially broad implications. This is particularly true with regard to textual documents such as newspaper articles and radio transcripts, as a large number of anaphor types existed in these documents, including relative pronouns, reflexive pronouns, personal pronouns, and definite noun phrases. This last type, though, presented a unique challenge because definite noun phrases are not always anaphoric. For example, the country, the vehicle, and the organization are quite likely anaphors, but the United States, the UN Secretary General and the CIA do not require a preceding antecedent to be understood. In the terrorism texts, a reader would be expected to recognize the MRTA and the FMLN (the names of two prevalent terrorist organizations) in the same way that an American reader would recognize the FBI. These nonanaphoric definite noun phrases seemed to be topic-specific and often based in real world knowledge.
Earlier research efforts had demonstrated that some nonanaphoric definite NPs could be recognized by their surrounding syntactic context. For example, in the mayor of San Francisco, the attached prepositional phrase (of San Francisco) generates enough context for the reader to understand the referent of the mayor. Using syntactic constraints, however, would not help address syntactically independent cases like the MRTA and the FMLN, which are common in the terrorism texts. So, while the existing approaches were useful, they left a large number of important cases untreated. Consequently, additional work was needed in the area.