Language analysis by parsing is typically performed by first dividing an input character string into sentences and then subjecting each of these sentences to an analysis process. However, when analyzing a very long sentence as one often seen in specifications for patent applications, a simple analysis process on a sentence-by-sentence basis may encounter some problems.
A language analysis apparatus, such as one for parsing, typically performs an analysis process by dividing an input character string into sentences and then examining the relationship between each of pairs of the words contained in each sentence. This means that the number of word pairs to be considered increases exponentially with an increase in the length of the input sentence.
If a very long sentence is to be analyzed, an enormous amount of word pairs must be computed. This should lead to various problems, including long analysis time and a large amount of memory capacity required for analysis.
Furthermore, the number of possible interpretations increases with an increase in the number of word pairs to be considered. This in turn increases the potential for analysis errors. To avoid this, there have been proposed various methods of dividing an input sentence, if it is too long, before performing an analysis process.
For example, in Patent Literature 1, a method is disclosed in which, if a machine translation process takes longer than the predetermined time, the previously given dividing rule is applied to divide the input sentence into smaller units and performs the machine translation process on each unit.
The method proposed in Patent Literature 2 stores division rules in association with adaptive word counts and applies division rules sequentially in the descending order of adaptive word counts so as to enable the input sentence to be divided into more appropriate units.
Patent Literature 1: Japanese Patent Laying-Open No. 61-255468
Patent Literature 2: Patent No. 003173514
The problems with the aforementioned methods of dividing input sentences for language analysis by parsing will be described below.
The first problem is that, when a maximum length of input (hereinafter, “maximum input length”) acceptable in an analysis process is given, a long sentence cannot be divided into processing units of an appropriate length according to such maximum length of input.
Division rules are roughly categorized into two types. One type of division rule focuses on linguistic expressions that provide relatively broad breaks, while the other pays attention to those that provide relatively fine breaks. Generally speaking, a division rule of the former type allows analysis to be made correctly, even if an analysis process is performed on each division unit as is, without any adjustments after dividing the sentence at the division point obtained by applying the division rule. However, this rule focuses on relatively rare specific linguistic expressions. This is problematic because a division points may not necessarily be obtained from all input sentences and, when they are actually obtained, each of the resulting division units may not be sufficiently short.
The latter type of division rule, on the other hand, obtains division points by focusing on linguistic expressions that are relatively frequently used. Therefore, a division rule of this type allows division points to be obtained from a relatively large number of sentences. In addition, the resulting division units are likely to be sufficiently short. On the other hand, this raises a problem in that analysis accuracy often decreases because individual division units may become too short to allow correct analysis of each division unit.
The division method disclosed in Patent Literature 2 attempts to resolve the problems described above by storing division rules in association with adaptive word counts and applying division rules sequentially in the descending order of the adaptive word counts. This method, however, also suffers a problem of decreased analysis accuracy. One reason for this is the difficulty in setting appropriate an adaptive word count for a division rule. Another is that, when a stage is reached where a division rule with a smaller adaptive word count needs to be applied, the resultant division units become too short to ensure correct analysis.