Cross-document entity co-reference refers generally to the problem of identifying whether mentions of names in different documents refer to the same or distinct entities. For example, the same entity can be referred to by more than one name string (e.g., Mahmoud Abbas and Abu Mazen both refer to the Palestinian Leader), and the same name string can be shared by more than one entity (e.g., John Smith is a common name).
Many previous efforts in cross document entity co-reference have focused on only entity disambiguation, using string retrieval to collect many documents that contain the same name. Others used artificially ambiguated data or analyzed only documents that contained well-structured English with proper grammar and punctuation. Moreover, much of this prior effort has analyzed only one entity type (usually persons), or only one source of data (news articles).
However, names in real-world situation in natural language documents are not always so well-structured. In a multi-genre multi-lingual environment, names can be misspelled, mistranslated, incorrectly transcribed or transliterated, have multiple aliases, and/or can have multiple equally valid spellings. The diversification of data sources to unstructured text (e.g., blogs, chats, e-mail correspondence, and web pages), speech, and foreign languages has made the cross-document co-reference task more difficult.
Available information extraction algorithms fail to perform with the same degree of accuracy on documents with invalid linguistic constructions that permeate these natural language sources. Therefore, systems and methods that are more capable of analyzing named entities in natural language situations are desirable.