The present invention deals with knowledge-poor sub-sentential paraphrasing. More specifically, the present invention deals with methods for learning meaning-preserving text segment alternations from word-aligned, parallel text (either monolingual or bilingual). The present invention also deals with selectively applying such alternations without introducing ambiguity and/or changing meaning.
The recognition and generation of paraphrases is a key problem for many applications of Natural Language Processing (NLP) systems. Being able to identify that two different pieces of text are equivalent in meaning enables a system to behave much more intelligently. A fundamental goal of work in this area is to produce a program that will be able to re-state a text segment in a manner that preserves its semantic content while manipulating features like vocabulary, word order, reading level, and degree of conciseness or verbosity.
One exemplary application which can benefit from paraphrase identification and generation includes a question answering system. For example, consider a question “When did the Governor of California arrive in Sacramento?”. It is very likely that a large data corpus, such as a global computer network (or a news reporting system that publishes articles on a global computer network) may already contain text that answers the question. In fact, such a corpus may already contain text that answers the question and is phrased in exactly the same terms as the question. Therefore, a conventional search engine may have no difficulty in finding text that matches the question, and thus returning an adequate result.
The same problem becomes more difficult when searching a smaller data corpus, such as one found on an intranet. In that case, even though the small data corpus may contain text that answers the question, the answer may be phrased in different terms than the question. By way of example, the following sentence answers the question set out above, but is phrased in different terms from the question:                The California Governor landed in Sacramento on Sep. 20, 2004.        
Since this answer is phrased differently than the question, a conventional search engine may encounter difficulty in returning a good result, given only the described textual answer in the corpus that it is searching.
Prior systems for addressing the problem of recognition and generation of paraphrases include large hand-coded efforts that attempt to address the problem in limited contexts. For example, large hand-coded systems attempt to map between a wide variety of different ways of saying the same thing and a form acceptable to a command and control system. Of course, this is extremely difficult because the author of the code likely cannot think of every different way a user might phrase something. Therefore, the focus in the research community has shifted from manual efforts to automatic methods of paraphrase identification and generation.
Recent work on systems aimed at automatically identifying textual paraphrase relations includes D. Lin and P. Pantel, DIRT-DISCOVERY OF INFERENCE RULES FROM TEXT, Proceedings of ACMSIGKDD Conference on Knowledge Discovery and Data Mining, pages 323-328 (2001). The DIRT article examines the distributional properties of dependency paths linking identical “anchor points” (i.e. identical or similar words) in a parsed corpus of newswire data. None of the special properties of news data are exploited since the parsed corpus is simply viewed as a large source of monolingual data. The basic idea is that high frequency dependency graph paths which link identical or similar words are themselves likely to be similar in meaning. When run over a gigabyte of newspaper data, the system identified patterns such as:                X is resolved by Y.        X resolves Y.        X finds a solution to Y.        X tries to solve Y.        
The DIRT system has been limited to a very restricted sort of “triple” relation, such as “X verb Y”.
Another article that deals with paraphrase identification is Y. Shinyama, S. Sekine, K. Sudo and R. Grisham, AUTOMATIC PARAPHRASE ACQUISITION FROM NEWS ARTICLES, Proceedings of Human Language Technology Conference, San Diego, Calif. (HLT 2002). In the Shinyama et al. article, the observation is made that articles from different newspapers that describe the same event often exemplify paraphrase relations. The paper describes a technique that relies on the assumption that named entities (such as people, places, dates and addresses) remain constant across different newspaper articles on the same topic or on the same day. Articles are clustered using an existing information retrieval system into, for example, “murder” or “personnel” groupings or clusters. Named entities are annotated using a statistical tagger, and the data is then subjected to morphological and syntactic analysis to produce syntactic dependency trees. Within each cluster, sentences are clustered based on the named entities they contain. For instance, the following sentences are clustered because they share the same four named entities:                Vice President Osamu Kuroda of Nihon Yamamuri Glass Corp. was promoted to President.        Nihon Yamamuri Glass Corp. decided the promotion of Vice President Osamu Kuroda to President on Monday.        
Given the overlap in named entities, these sentences are assumed to be linked by a paraphrase relationship. Shinyama et al. then attempt to identify patterns that link these sentences using existing machinery from the field of information extraction.
Shinyama et al. also attempt to learn very simple phrase level patterns, but the technique is limited by its reliance on named entity anchor points. Without these easily identified anchors, Shinyama et al. can learn nothing from a pair of sentences. The patterns that Shinyama et al. learn all center on the relationship between a particular type of entity and some type of event within a particular domain. The results are fairly poor, particularly when the training sentences contain very few named entities.
Another article also deals with paraphrases. In Barzilay R. and L. Lee, LEARNING TO PARAPHRASE: AN UNSUPERVISED APPROACH USING MULTIPLE-SEQUENCE ALIGNMENT, Proceedings of HLT/NAACL: (2003), Edmonton, Canada, topic detection software is used to cluster thematically similar newspaper articles from a single source, and from several years worth of data. More specifically, Barzilay et al. attempts to identify articles describing terrorist incidents. They then cluster sentences from these articles in order to find sentences that share a basic overall form or that share multiple key words. These clusters are used as the basis for building templatic models of sentences that allow for certain substitutional elements. In short, Barzilay et al. focuses on finding similar descriptions of different events, even events which may have occurred years apart. This focus on grouping sentences by form means that this technique will not find some of the more interesting paraphrases.
Also Barzilay and Lee require a strong word order similarity in order to class two sentences as similar. For instance, they may not class even active/passive variants of an event description as related. The templatic paraphrase relationships learned by Barzilay et al. are derived from sets of sentences that share an overall fixed word order. The paraphrases learned by the system amount to regions of flexibility within this larger fixed structure. It should also be noted that Barzilay and Lee appear to be alone in the literature in proposing a generation scheme. The other work discussed in this section is aimed only at recognizing paraphrases.
Another paper, Barzilay and McKeown Extracting Paraphrases From a Parallel Corpus, Proceedings of ACL/EACL (2001), relies on multiple translations of a single source document. However, Barzilay and McKeown specifically distinguish their work from machine translation techniques. They state that without a complete match between words in related sentences, one is prevented from using “methods developed in the MT community based on clean parallel corpora.” Thus, Barzilay and McKeown reject the idea that standard machine translation techniques could be applied to the task of learning monolingual paraphrases.
Another prior art system also deals with paraphrases. This system relies on multiple translations of a single source to build finite state representations of paraphrase relationships. B. Pang, K. Knight, and D. Marcu, SYNTAX BASED ALIGNMENT OF MULTIPLE TRANSLATION: EXTRACTING PARAPHRASES AND GENERATING NEW SENTENCES, Proceedings of NAACL-HLT, 2003.
Still another prior reference also deals with paraphrase recognition. Ibrahim, Ali, EXTRACTING PARAPHRASES FROM ALIGNED CORPORA, Master Thesis, MIT (2002). In his thesis, Ibrahim indicates that sentences are “aligned” or subjected to “alignment” and that paraphrases are identified. However, the term “alignment” as used in the thesis means sentence alignment instead of word or phrase alignment and does not refer to the conventional word and phrase alignment performed in machine translation systems. Instead, the alignment discussed in the thesis is based on the following paper, which attempts to align sentences in one language to their corresponding translations in another:
Gale, William, A. and Church, Kenneth W., A PROGRAM FOR ALIGNING SENTENCES IN BILINGUAL CORPORA, Proceedings of the Associations for Computational Linguistics, Pages 177-184 (1991). Ibrahim uses this algorithm to align sentences within multiple English translations of, for example, Jules Verne novels. However, sentence structure can vary dramatically from translation to translation. What one translator represents as a single long sentence, another might map to two shorter ones. This means that the overall number of sentences in the different translations of a single novel do not match, and some sort of automated sentence alignment procedure is needed to identify equivalent sentences. The overall technique Ibrahim uses for extracting paraphrases from these aligned monolingual sentences is derived from the multiple-translation concepts set forth in the Barzilay, McKeown reference, plus a variation on the DIRT framework described by Lin et al.