1. Field of the Invention
The present invention relates to a natural language analysis technique, and in particular to an information processing apparatus for calculating a score of matching of a sentence with a query pattern having a dependency structure, a natural language analysis method, a program and a recording medium.
2. Description of Related Art
Recently, with the development of information processing techniques, such as those used by computers and the Internet, an enormous amount of atypical text information is being generated, and utilization of such text information is growing in importance. For a sentence in a natural language such as Japanese or English, it is possible to presume a semantic dependency structure among the words by dividing the sentence into words by morpheme analysis and performing dependency syntax analysis. Recently, there has been increasing demand to extract a particular reputation expression from reputation information and the like about a product or extract a characteristic expression about a technique. In light of this, there is now a demand to develop a technique capable of performing highly accurate information searches and information extractions in consideration of not only whether a particular word exists, but also a dependency structure which is a higher semantic expression.
In such extraction, there is, however, a possibility that extract omission is caused due to a dependency syntax analysis error itself. For a sentence written in a natural language, multiple interpretable syntax trees exist for one sentence due to the ambiguity specific to natural languages. Therefore, dependency syntax analysis frequently causes occurrence of an analysis error in comparison with morpheme analysis and the like. Though the analysis accuracy of each Bunsetsu phrase is about 90%, the accuracy of the whole dependency structure being correctly analyzed is lower. In a simple trial calculation, the analysis accuracy in the case of a pattern including two dependency relations is about 81%, and it decreases to about 73% if three dependency relations are included.
As prior-art techniques for performing an information search and extraction in consideration of a dependency structure, there are known approaches called a 1-best method, an N-best method and intrasentential co-occurrence. The 1-best method is a method in which pattern matching is performed for the best syntax analysis result for a sentence showing the highest score. The N-best method is a method in which: N high-score syntax analysis results for a sentence are acquired; pattern matching is performed for the N syntax analysis results; and, if there is any syntax analysis result that matches the pattern, it is determined that the pattern is matched (V. M. Jimenez, A. Marzal, “Computation of the n best parse trees for weighted and stochastic context-free grammars”, Advances in Pattern Recognition, Lecture Notes in Computer Science, Volume 1876/2000, 183-192, 2000). The intrasentential co-occurrence is an approach in which matching is performed for a sentence depending on whether multiple words co-occur or not. Non Patent Literature 2 (Yuya Unno, Yuta Tsuboi, “Intersegment distance based on marginal probability of dependency”, Collection of Papers of The Sixteenth Annual Meeting of The Association for Natural Language Processing, pp. 23-26, March, 2010) discloses an approach for calculating an expected value of a distance on a dependency tree for the purpose of robustly performing information extraction from the dependency tree. In addition, for example, Japanese Patent No. 4049141, Japanese Patent No. 4341077, Japanese Patent Laid-Open No. 2001-134575 are known as prior-art techniques related to tree-structure pattern extraction.