In order to perform speech recognition and speech synthesis on a language, such as Japanese, where boundaries between words are not explicitly expressed, it is desirable that a text be correctly segmented into words. Additionally, in order to achieve highly accurate segmentation, it is desirable that various words be previously registered, in association with appearance frequencies of the respective words, in a dictionary in a segmentation device for dividing a text into words. Conventionally, a training text in which boundaries between words are made clear has been required in order to register a sufficient amount of words. However, such a training text needs to be manually constructed, and it has been difficult to secure the training text in sufficient volume.
On the other hand, techniques for enabling judgment on boundaries between words without having a training text in sufficient volume have been proposed. In one of these techniques, statistic information on a frequency at which a certain character and another character are continuously written in a word, the number of characters in a word and the like is computed previously from a training text, and the static information is used for making a determination on a word unregistered in a dictionary (refer to Mori et al., “An Estimate of an Upper Bound for the Entropy of Japanese,” Journal of Information Processing Society of Japan, Vol. 38, No. 11, pp. 2191-2199 (1997); Nagata, “A Japanese Morphological Analysis Method Using a Statistical Language Model and an N-best Search Algorithm,” Journal of Information Processing Society of Japan, Vol. 40, No. 9, pp. 3420-3431 (1999); Itoh et al., “A Method for Segmenting Japanese Text into Words by Using N-gram Model,” Research Report of Information Processing Society of Japan, NL-122 (1997); Uchimoto, et al., “Morphological Analysis Based on A Maximum Entropy Model: An Approach to The Unknown Word Problem,” Natural Language Processing, Vol. 8, No. 1, pp. 127-141 (2001); Asahara, and Matsumoto, “Unknown Word Identification in Japanese Text Based on Morphological Analysis and Chunking,” Research Report of Information Processing Society of Japan, NL154-8, pp. 47-54 (2003)). In another proposed technique, computed is an index value indicating a likelihood that a certain inputted character string is a word (refer to Mori, and Nagao, “Unknown Word Extraction from Corpora Using n-gram Statistics,” Journal of Information Processing Society of Japan, Vol. 39, No. 7, pp. 2093-2100 (1998); Yamamoto, M., and Church, K. W., “Using Suffix Arrays to Compute Term Frequency and Document Frequency for all Substrings in a Corpus,” Computational Linguistics, Vol. 27, No. 1, pp. 1-30, (2001)). However, in any one of these techniques, when it is attempted to highly accurately make a determination on a word unregistered in a dictionary, a contradiction that sufficient information is required on properties of the word to be determined sometimes occurs. Additionally, in a case where information on an unregistered word is invariable, there is a tradeoff between the accuracy of detection and the number of words detectable as unregistered words, that is, the accuracy (precision) becomes more likely to decrease while an increase of the number of words, that is to say recall ratio is attempted.