Prior art sentence dividing techniques are generally classified into the following four types:
(1) Division by kind of character
A change of character types such as Kanji, Katakana or Hiragana is used for making the judgment at the time of dividing. With only this information, 84% of the correct divisions can be accomplished (Sakamoto, "Recognition of a Clause", Collection of Reports of the Japanese Information Processing Symposium, July 17-20, 1978, pp. 105-111, The Information Processing Society). However, this is usually utilized as the preprocessing or part of the following techniques. That is, after a text is roughly segmented with this approach, the segmented pieces are analyzed in more detail. By this preprocessing, the unit of the subsequent analysis can be short, thereby making possible the shortening of the processing time. However, in the case of incorrect divisions, segmentation at a wrong place seriously affects the subsequent processing, so these kinds of errors must be prevented, or corrected later. This approach does not provide for detailed analysis and division as in the present invention.
(2) Division by a word dictionary
Most of the currently published systems are of this type (Nagao et al., "Storage of a National Language Dictionary and Automatic Division of a Japanese Sentence", Information Processing, Vol. 19, No. 6, June 1978). In this technique, in order to improve the divisional precision, it is always necessary to complete or add to the contents of the dictionary in compliance with a text to be parsed (mainly, addition of new words) and to change the program in compliance with the algorithms on the application of words.
In either technique, the greatest disadvantage is that the dictionary and the algorithm depend on the field to which they apply and maintenance continues forever for both the dictionary and the program. There is also a method in which several kinds of dictionaries are prepared to eliminate the burden of the program change, this will make the system maintenance difficult because of complex interrelation of the effects among the dictionaries.
(3) Division by the Nature of Kanji
It is almost impossible to register all the words used in the Japanese language in a dictionary, but it may be possible to register most of the Kanji characters used. Noting this point, there is a technique for division which uses a dictionary in which the use and reach of each Kanji character in words are described in conjunction with the characters occurring before and after it (Takano, Araki, Kaneko, Hinatsu, "A Japanese Keyword Automatic Extraction System (JAKAS)". The Collection of The 18th Information Science and Technology Study Conference, pp. 35-44, 1981). Using this technique, the entries of the dictionary can certainly be reduced to a relatively small number. However, since the meaning possessed by each Kanji character is not so general as the part of speech of a word, the past accumulation of lexical knowledge such as in a dictionary for Japanese language can not be directly utilized. Therefore, it is unclear whether the information in the dictionary works well for texts other than the titles of the science and engineering literature attempted in the reference.
(4) Division by statistical information of character chain
This is a technique by which the immediately above method is implemented using a statistical approach (dynamic programming). (Fujisaki, "Unit Segmentation and Kana Allocation of a Writing in Kanji and Kana by Dynamic Programming", Information Processing NL Study, Natural Language 28-5, Nov. 20, 1981). Since provision of the information to be possessed by each Kanji character is automatically done (by using probability statistics) if a large quantity of texts are available, it is unnecessary to spend much time maintaining the dictionary. However, at the present time, there is the problem of how to collect a large quantity of electronized texts to attain sufficient precision. Also, this technique has a drawback in that it is difficult to predict what and how many texts must be collected to attain a certain precision, and what kinds of errors are reduced as the precision of the dictionary increases.