This invention relates to the acquisition of paraphrases from text by automatic data processing systems.
Information captured in text, e.g., text stored in documents, can encode semantically equivalent ideas through different lexicalizations. Indeed, given the generative power of natural language, different people typically employ different words or phrases to convey the same meaning, depending on factors such as background knowledge, level of expertise, style, verbosity and personal preferences. Two semantically equivalent fragments of text may differ only slightly, as a word or a phrase in one of them is paraphrased in the other, e.g., through a synonym. Yet even small lexical variations represent challenges to any automatic process attempting to decide whether two text fragments have the same meaning, or are relevant to each other, since they are no longer lexically identical. Many natural-language intensive computer program applications include modules that make such decisions. In document summarization, the generated summaries have a higher quality if redundant information has been discarded by detecting text fragments with the same meaning. In information extraction, extraction templates will not be filled consistently whenever there is a mismatch in the trigger word or the applicable extraction pattern. Similarly, a question answering system could incorrectly discard a relevant document passage based on the absence of a question phrase deemed as very important, even if the passage actually contains a legitimate paraphrase.
Identifying different words or phrases with the same meaning can be useful in various contexts, such as identifying alternative query terms for a specific user query. Deciding whether a text fragment (e.g., a document) is relevant to another text fragment (i.e., the query) is generally crucial to the overall output, rather than merely useful within some internal system module. Indeed, relevant documents or passages may be missed, due to an apparent mismatch between their terms and the paraphrases occurring in the users' queries. The previously proposed solutions to the mismatch problem vary with respect to the source of the data used for enriching the query with alternative term.