1. Field of the Invention
The present invention relates to computerized documents, and more particularly to a method for coreferencing two or more names which refer to a same entity.
2. Description of the Related Art
The need to identify and extract important concepts in on-line text documents is commonly acknowledged by researchers and practitioners in the fields of information retrieval, knowledge management and digital libraries. It is a necessary first step towards achieving a reduction in the ever-increasing volumes of on-line text.
There are several challenging aspects to the identification of names: identifying the text strings (words or phrases) that express names; relating names to the entities discussed in the document; and relating named entities across documents. In relating names to entities, the main difficulty is the many-to-many mapping between them. A single entity can be referred to by several name variants: FORD MOTOR COMPANY, FORD MOTOR CO., or simply FORD. A single variant often names several entities: Ford refers to the car company, but also to a place (Ford, Michigan) as well as to several people: President Gerald Ford, Senator Wendell Ford, and others. Context is crucial in identifying the intended mapping. A document usually defines a single context, in which it is quite unlikely to find several entities corresponding to the same variant. For example, if the document talks about the car company, it is unlikely to also discuss Gerald Ford. Thus, within documents, the problem is usually reduced to a many-to-one mapping between several variants and a single entity. In the few cases where multiple entities in the document may potentially share a name variant, the problem is addressed by careful editors, who refrain from using ambiguous variants. If Henry Ford, for example, is mentioned in the context of the car company, he will most likely be referred to by the unambiguous Mr. Ford.
Much recent work has been devoted to the identification of names within documents and to linking names to entities within the document. Several research groups, as well as a few commercial software packages, have developed name identification technology. However, few have investigated named entities across documents. In a collection of documents, there are multiple contexts; variants may or may not refer to the same entity; and ambiguity is a much greater problem. Cross-document coreference was briefly considered as a task for the Sixth Message Understanding Conference but then discarded as being too difficult (see, Tipster Text Program. Sixth Message Understanding Conference (MUC-6).
Recently, Bagga and Baldwin, in xe2x80x9cEntity-based cross-document coreferencing using the vector space model,xe2x80x9d Proceedings of COLING-ACL 1998, pages 79-85, proposed a method for determining whether two names (mostly of people) or events refer to the same entity by measuring the similarity between the document contexts in which they appear. The approach of Bagga and Baldwin is to compare every two names which share a substring in common, for example, xe2x80x9cPresident Clintonxe2x80x9d and xe2x80x9cClinton, Ohio,xe2x80x9d to determine whether they refer to the same entity. This approach suffers from a potentially n-squared number of comparisons, which is a very costly process and cannot scale to process the size of current, and most certainly future, document collections. In addition, Bagga and Baldwin""s approach does not address another cross-document problem of names that are potentially combinations of two or more names, which should be separated into their components, such as xe2x80x9cPresident Clinton of the United States.xe2x80x9d
Therefore, a need exists for a coreferencing system and method which can be employed across a plurality of documents.
A method for coreferencing a plurality of documents, which may be implemented by a program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps, the method steps include providing a name list for names extracted from documents to be coreferenced prior to or upon entry of a query by a user, sorting the names of the list of names into mergable names and exclusive sets, comparing contexts of the mergable names against the exclusive sets to merge the mergable names to the exclusive sets exceeding a predetermined threshold to form an aggregated cross-document name list and referencing the aggregated cross-document name list to provide the user with coreferenced names across the plurality of documents which refer to a same entity in accordance with the query.
A method for searching a plurality of documents for an entity having a plurality of variant names, which may be implemented by a program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps, the method steps include providing a name list for names extracted from documents to be coreferenced prior to or upon entry of a search query by a user including a name of the entity, sorting the names of the list of names into mergable names and exclusive sets, comparing contexts of the mergable names against the exclusive sets to merge the mergable names to the exclusive sets exceeding a predetermined threshold to form an aggregated cross-document name list, the aggregated cross-document name list including a list of variant names for the entity and providing a list of documents to the user referencing the variant names and the name of the entity used for the search query.
In alternate methods, which may be implemented by the program storage device, the step of extracting the name list from a collection of documents by employing a name extractor may be included. The step of normalizing the name list to provide the names in the name list in a predetermined format may also be included. The step of splitting the names of the name list into component names based on evidence derived from the names may also be included. The evidence derived from the names may include one of prefixes, suffixes, titles and information indicating one of a place, organization and a person. The step of merging identical names to reduce the name list may be included. The step of sorting the name list into a person list and a place list may also be included. The step of comparing contexts of the mergable names against the exclusive sets may include the steps of mapping all of the mergable names to each of the exclusive sets to provide matches above the predetermined threshold.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.