Question-answering (QA) research for questions related to some facts, so called factoid question, has recently achieved great success. It is still fresh in our memory that a system of this kind defeated human contestants in a quiz program in the United State. On factoid questions, its accuracy is reported to be about 85%. Researchers begin to recognize necessity of studying question-answering systems attaining similarly high accuracy in fields other than the factoid question-answering systems. Studies related to question-answering systems for non-factoid questions, such as “why” questions and “how to” questions, however, do not show substantial progress.
Non-Patent Literature 1 listed below discloses an example of such a system. In this system, a question and each of the sentences in a corpus are subjected to morphological analysis, and using the result of analysis, a score is calculated using document frequency of a term obtained from the question, frequency of occurrence of the term in each sentence, total number of documents and the length of documents. Then, a prescribed number of documents with higher scores are selected from the corpus. Paragraphs or one to three continuous paragraphs contained in the selected documents are answer candidates. Based on a score calculated mainly from terms in the question and terms contained in the answer candidates, an answer to the question is selected.
This system, however, is found to be unsatisfactory, as will be described later. Then, as an improvement over the system, a system described in Non-Patent Literature 2 has been proposed. According to this system, several answer candidates are selected by the technique described in Non-Patent Literature 1, and each of the answer candidates are re-ranked using prescribed scores.
In the following, a typical implementation of the system will be summarized based on the description of Non-Patent Literature 2. In the following, a question not related to a fact will be referred to as a “non-factoid question.”
Referring to FIG. 1, a question-answering system 30 stores corpus including a huge number of sentences (limited to Japanese here) that can be searched over the Internet in a corpus storage 48. The system receives a non-factoid question transmitted from a service utilizing terminal 44 capable of text communication such as a portable telephone, an answering unit 40 selects several sentences considered to be highly probable answers from among the huge number of sentences stored in the corpus storage 48, and the selected sentences are returned as an answer list 50, to service utilizing terminal 44. Answering unit 40 uses support vector machines (SVMs) 46 to rank the answer sentences, and a training unit 42 trains SVMs 46 in advance using supervised machine learning.
Training unit 42 includes: QA sentences storage 60 storing, in advance, Japanese QA sentences including non-factoid questions, correct or incorrect answers thereto, and flags indicating whether the answers are correct or not; a training data generating unit 62, analyzing QA sentences stored in QA sentences storage 60 and generating, as features to be used for training SVMs 46, training data including pre-selected various combinations of statistical information related to syntax and flags indicating whether an answer to each QA is a correct answer to the question; a training data storage 64 storing training data generated by training data generating unit 62; and a training unit 66 realizing supervised machine learning of SVMs 46 using the training data stored in training data storage 64. As a result of this training, SVMs 46 comes to output, when it receives features of the same type of combination as generated by training data generating unit 62, a measure indicating whether the combination of the question sentence and the answer candidate that caused the combination of features is a correct combination or not, namely, whether the answer candidate is the correct answer to the question.
It is assumed that each sentence stored in corpus storage 48 is subjected to the same analysis as conducted beforehand on each answer by training data generating unit 62, and that information necessary to generate the features to be applied to SVMs 46 is assigned to each sentence.
Answering unit 40 includes: a question analyzing unit 86, responsive to reception of a question sentence from service utilizing terminal 44, for performing predetermined grammatical analysis of the question sentence and outputting pieces of information (part of speech, conjugation, dependency structure and the like) necessary for generating features, for each word or term included in the question sentence; a candidate retrieving unit 82, responsive to reception of a question sentence from service utilizing terminal 44, for searching and extracting a prescribed number of (for example, 300) answer candidates to the question from corpus storage 48; and an answer candidate storage 84 for storing the prescribed number of candidates output from candidate retrieving unit 82 with grammatical information thereof.
Though candidates are searched and extracted from corpus storage 48 and stored in answer candidate storage 84 in this example, it is unnecessary to narrow down the candidates. By way of example, all sentences stored in corpus storage 48 may be regarded as the answer candidates. Here, what is required of candidate retrieving unit 82 is simply to have a function of reading all sentences stored in corpus storage 48, and what is required of answer candidate storage 84 is simply to have a function of temporarily storing the sentences read by candidate retrieving unit 82. Further, though question-answering system 30 locally holds corpus storage 48 in this example, it is not limiting. Corpus 48 may be remotely located, and it may be stored in not only one storage device but distributed and stored in a plurality of storage devices.
Answering unit 40 further includes: a feature vector generating unit 88 for generating feature vectors based on the combination of information output from question analyzing unit 86 and each of the answer candidates stored in answer candidate storage 84, and for applying the feature vectors to SVMs 46; and an answer ranker unit 90 applying the feature vectors given from feature vector generating unit 88 to the combinations of the question sentence and each of the answer candidates and, based on the results eventually output from SVMs 46, ranking each of the answers stored in answer candidate storage 84, and outputting a prescribed number of answer candidates higher in rank as an answer list 50. Typically, a basic function of SVMs 46 is to mathematically find a hyper plane for classifying objects to two classes and to output the results as positive/negative polarity information. It is noted, however, that the SVMs can also output a distance from the hyper plane to the point defined by an input. The distance is considered to represent appropriateness of an answer and, therefore, answer ranker unit 90 uses a combination of the distance and the polarity information output from SVMs 46 as a score of the answer candidate.
In this question-answering system 30, a large number of combinations of a question and sentences as positive examples appropriate as answers to the question, and a large number of combinations of the question and sentences as negative examples incorrect as answers to the question are stored in advance in QA sentences storage 60. A flag indicating whether the answer is correct or not is manually added to each combination. Training data generating unit 62 generates training data for training SVMs 46 from these combinations, and stores the data in training data storage 64. Using the training data stored in training data storage 64, training unit 66 trains SVMs 46. As a result of this process, SVMs 46 acquires the ability to output, when it receives a combination of features of the same type as generated by training data generating unit 62, a value indicating whether the combination of source sentences (question and answer candidate) is corrector not, or a value indicating degree of correctness of the answer candidate to the question.
On the other hand, a corpus including a large number of sentences is stored in corpus storage 48. Each sentence has been subjected to the same type of analysis as conducted by training data generating unit 62, and each sentence has information for ranking the answer candidates, similar to part of the training data, assigned thereto. Upon receiving a question sentence from service utilizing terminal 44, candidate retrieving unit 82 performs a known candidate retrieving process and extracts a prescribed number of answer candidates to the question sentence from corpus storage 48. The answer candidates extracted by candidate retrieving unit 82 are stored, together with the information for ranking the answer candidates, in answer candidate storage 84.
On the other hand, question analyzing unit 86 performs a prescribed analysis on the question sentence, and thereby generates information necessary to generate features, and applies it to feature vector generating unit 88. Upon receiving the information from question analyzing unit 86, feature vector generating unit 88 combines this with the information for ranking answer candidates of each answer candidate stored in answer candidate storage 84, and thereby generates feature vectors having the same configuration as the training data generated by training data generating unit 62 (without the flag indicating if the answer candidate is correct or not), and applies the feature vectors to answer ranker unit 90.
Answer ranker unit 90 applies the feature vectors obtained from the combination of each answer candidate and the question sentence applied from feature vector generating unit 88 to SVMs 46. For the applied feature vectors of each of the combinations, SVMs 46 outputs a score indicating how appropriate the answer candidate in the combination is for the question in the combination. Answer ranker unit 90 sorts combinations of the question and each answer candidate in descending order of the score, and returns a prescribed number of answer candidates higher in rank in the form of an answer list 50 to the question applied from service utilizing terminal 44, to service utilizing terminal 44.