1. Field of the Invention
The present invention relates to a method of segmenting a text into words and more particularly to a method of automatically segmenting a text into words, suitable for use in a Japanese language processing system for processing an input text which may contain unknown words.
2. Description of the Prior Art
In various kinds of natural language processing systems including machine translation, a dictionary is used which employs a word as an index, and the processing of a text is carried out while searching the dictionary (refer to, for example, a Japanese Patent application specification unexamined publication No. sho 56-17467). However, it is impossible to previously register all words which may appear in the text, in the dictionary, and hence it is an important problem from the practical point of view to show how an unknown word is treated. That is, it is required to efficiently identify an unknown word contained in the text. Although an unknown word can be readily identified in a language having a space between words such as the English language, it is very difficult to identify an unknown word in an agglutinative language, such as the Japanese language.
In order to automatically process a text written in the language having no space between words, such as the Japanese language, it is required to segment the text into words at the first stage. For this purpose, a method has been widely used in which the text is scanned from the head, a word dictionary is searched by using a character string in the text as an index, and it is checked, for example on the basis of the part of speech whether a word can be connected to a preceding word. In this method, the segmentation of the text into words reaches a deadlock (that is, the dictionary search results in a failure or a word capable of being connected with the preceding word cannot be found) for two reasons, one of which is the presence of an unknown word and the other is the result of erroneous segmentation. Accordingly, when the segmentation reaches a deadlock the judgment cannot be immediately made that an unknown word is contained and it is required to carry out backtracking processing to get other segmenting possibilities. That is, in the case where the segmentation reaches a deadlock because of the presence of an unknown word, it can be judged that an unknown word is contained in the text only at a time the backtracking processing has reached the head of the text. Accordingly, it takes a lot of time to judge that the unknown word is contained in the text. Furthermore, in a case where it is judged that the unknown word is contained in the text, it is not easy to restart segmentation processing from the position of the unknown word. In more detail, since the backtracking processing is carried out without taking into consideration the presence of the unknown word and information on the result of segmentation having failed halfway is not preserved, it will be impossible to identify the position of the unknown word when it is judged that an unknown word is contained in the text.
As mentioned above, in various kinds of natural language processing systems, it is impossible to register all words in a dictionary from the practical point of view, and hence it is necessary to allow the presence of an unknown word in the input text. For example, in a machine translation system, it is preferred that a text containing an unknown word is not considered to be untranslatable but a translation containing the unknown word written in the source language is provided.