In a question-answering system, when a question such as “What causes lung cancer?” is given, an answer such as “Pollution causes lung cancer” is returned as a typical answer, since these two sentences commonly have the expression “cause(s) (lung cancer).” An appropriate answer, however, is not limited to such a sentence that has an expression common to the question. By way of example, an expression “Smoking leads to lung cancer” may be also an appropriate answer. In order to obtain such an answer, one must have the knowledge that “A cause(s) B” is exchangeable with “A lead(s) to B.” Here, A and B are variables that can be replaced by any words.
In the present specification, types that are commonly found in a plurality of expressions will be referred to as language patterns or simply patterns. Specifically, in the present specification, an expression consisting of a combination of a verb phrase and n (n is an integer not smaller than 0) terms will be referred to as an n-term language pattern. “A cause(s) B” is a 2-term language pattern consisting of a combination of a verb phrase “cause(s)” and two terms, that is, variable terms A and B.
If two language patterns (pattern pair) have an entailment relation, in the present specification, such a pattern pair is referred to as an entailment pattern pair (or simply “entailment pair”). In a question-answering system, it is desirable to collect a large number of entailment pairs with high precision.
Non-Patent Literature 1 specified below describes a conventional technique of collecting entailment pairs. According to Non-Patent Literature 1, pattern pairs having entailment relations are collected in the following manner. First, pattern pairs known to have entailment relations are manually collected to build training data. Using the training data, a determiner that determines, when two language patterns are given, whether one entails the other, is trained using scores of N-gram, distributional similarity or the like as features. When learning of the determiner is complete, a huge number of entailment pair candidates are generated at random from a corpus including a large number of sentences. Each candidate is subjected to determination by the determiner. Pattern pairs determined to have entailment relations are collected and, thus, new entailment pairs not existing in the training data can be collected.