The exemplary embodiment relates to processing of text and finds particular application in connection with named entity coreference resolution.
A named entity is the name of a unique entity, such as a person, place, or thing. Identifying references to named entities in text is useful in many information access and question-answering systems. Named entity coreference resolution refers to the process of identifying different references to the same named entity. The problem is not only to find the correct entity for pronouns or implicit mentions, but also to recognize that the same named entity can appear with different superficial forms, due to spelling variants, use of parts of the complete name, acronyms, and so forth. The issue arises not only within a particular document (intra-document coreference resolution), but also when considering a set of documents and/or queries (inter-document coreference resolution).
Natural language processing systems often include a named entity recognition component which applies a set of rules for identifying named entities in text. These systems tend to over-generate named entities. This means that several different entities are generated for the same real named entity, favoring precision over recall.
News-media aggregators often wish to create clusters of news articles talking about one event. It would useful for them to be able to relate news articles to events through the actors (e.g., people and organizations) that took part in them. However, the over-generation of named entities combined with high frequency with which new named entities appear make this difficult.
Statistical methods for named entity co-reference resolution traditionally compute a similarity between two references to an entity in the text, and then collect some of these references into an equivalence cluster. Clustering can be achieved, for example, with coreference chains, tree traversal, or graph-cut algorithms (see, Vincent Ng et al., “Improving machine learning approaches to coreference resolution,” Proc. Annual Meeting of Association for Computational Linguistics (ACL), pp. 104-111 (July 2002); Xiaoqiang Luo, et al., “A mention-synchronous coreference resolution algorithm based on the bell tree,” Proc. ACL Annual Meeting, article 135 (2004); and Cristina Nicolae, et al., “BestCut: A Graph Algorithm for Coreference Resolution,” Proc. Empirical Methods in Natural Language Processing, pp. 275-283 (July 2006)).
The similarity between two references can be computed using a variety of features, which can be grouped into lexical (string-based comparison), proximity (number of words or paragraphs between two references), grammatical (based on parts of speech, parse trees, and the like) and semantic (gender, animacy, and the like). See, Veselin Stoyanov, et al., “Conundrums in Noun Phrase Coreference Resolution: Making Sense of the State-of-the-Art,” in Proc. Joint Conf. of the 47th Annual Meeting of the ACL and the 4th International Joint Conf. on Natural Language Processing of the AFNLP, Vol. 2, pp. 656-664 (hereinafter, Stoyanov, et al.). Some of these features are based on the content of the named entity; that is, on the superficial form it takes, some additional information provided by a syntactic parser, and on the local context given by surroundings words. Content-independent features relate only to the close context and thus are only intra-document.
Recently, it has been suggested that temporal information could be used for solving the coreference resolution problem. This method assumes that if two entities have the same temporal profile (in other words, when their bursts of appearances closely coincide) and if their approximate string matching similarity is below a defined threshold, these entities refer to the same real one. See, for example, Alexander Kotov, et al., “Mining named entities with temporally correlated bursts from multilingual web news streams,” in WSDM, pp. 237-246, ACM, 2011. However, such methods are prone to noise.
Another study focused on multi-lingual named entity disambiguation. The main issue faced is the correct translation of person names and the subsequent comparison between these translations. The similarity between two named entities is computed through the following process: transliteration into roman script, lower-casing, name normalization, vowel removal and finally edit distance See, Ralf Steinberger, et al., “JRC-NAMES: A Freely Available, Highly Multilingual Named Entity Resource,” Proc. Recent Advances in Natural Language Processing, pp. 14-16, 2011. An additional constraint is that, in order to merge two named entities, they have to appear in the same news cluster. This work resulted in the generation of a database of different name variants for the same named entity. The database, while a good resource for some applications, has a high-precision, low-recall rate and addresses only the most frequent named entities in the news sphere.
The present system and method provide more reliable named entity coreference resolution by add contextual “socio-temporal” features given by the clustering of the documents into topic-coherent events.