Along with the progress of recognition technology for natural language, various techniques, including kana-kanji conversion, spelling checking (character error correction), OCR, and speech recognition techniques, have achieved a practical-level predication capability. At present, most of the methods for implementing these techniques with high accuracy are based on probabilistic language models and/or statistical language models. Probabilistic language models are based on the frequency of occurrence of words or characters and require a collection of a huge number of texts (corpus) in an application field.
The following documents are considered:                [Non-patent Document 1] “Natural Language Processing: Fundamentals and applications”, edited by Hozumi Tanaka, 1999, Institute of Electronics, Information and Communication Engineers        [Non-patent Document 2] W. J. Teahan, and John G. Cleary, 1996, “The entropy of English using ppm-based models”, In DCC.        [Non-patent Document 3] Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Chales J. Stone, 1984, Classification and Regression Trees, Chapman & Hall, Inc.        [Non-patent Document 4] Masaaki Nagata, “A Self-Organizing Japanese Word Segmenter using Heuristic Word Identification and Re-estimation”, 1997        
In most speech recognition systems, the most probable character string is selected from among a number of candidates by referring to a probabilistic language model as well as an acoustic model. In spell checking (character error correction), unnatural character strings and their correction candidates are listed based on the likelihood of a probabilistic language model.
Because a practical model treats a word as a unit, it is required that a corpus be provided with information about word boundaries. In order to determine word boundaries, an operation such as segmentation or tagging is performed.
Automatic word segmentation methods have been already known. However, the existing automatic word segmentation systems provide low accuracies in fields such as the medical field, where many technical terms are used. To manually correct the results of automatic word segmentation, the operator needs to have knowledge of technical terms in the application field, and typically, a minimum of tens of thousands sentences are required in order to achieve recognition sufficiently accurate enough for practical use.
In training using a corpus in an application field, it is generally difficult to obtain a huge corpus segmented and tagged manually for the application field, taking much time and cost and thus making it difficult to develop a system in a short period.
Although information segmented into words in a field (for example in the medical field) may works in processing the language in that field, there is no assurance that the information will work also in another application field (for example in the economic field, which is completely different from the medical field). In other words, a correct corpus segmented and tagged in a field may be definitely correct in that field, but may not necessarily correct in other fields because the segmented and/or tagged corpus has been fixed by segmentation and/or tagging.
In this regard, there are many techniques in the background art that are pursuing efficiency and accuracy in word segmentation in Asian languages. However, all of these techniques are aiming to predetermine word boundaries in word segmentation fixedly.
Taking Japanese out of the Asian languages as an example, word information required for analyzing Japanese text relates to the structure of word spelling, which is the information regarding the character configuration (representation form) and pronunciation of entry words, including “spelling information”, “pronunciation information”, and “morphological information”. These items of information may provide important clues mainly in extracting candidate words from Japanese text in morphological analysis.
Although there is no clear definition of the term “word”, attention is directed to two elements of the “word” herein, “spelling” and “pronunciation” and two words are regarded as the same words if and only if they have the same spelling (characters) and pronunciation. Isomorphic words just having the same spelling (characters) or homophonic words just having the same pronunciation are regarded as different words. The spelling of a word is involved in identifying a morphological characteristic and the pronunciation is involved in identifying a phonemic characteristic.
Hence, the Japanese words composed of two Chinese characters  (reporter),  (train),  (return to the office), and   (charity) all have the same pronunciation  (kisha) but different spellings (characters), whereby they are different words. The “word” is symbolized in the computer, in which the correspondence between the symbol as the spelling (character) and the symbol as its meaning is registered. Japanese is one kind of agglutinative language, and has an extremely high word formation power, whereby care must be taken in registering words in the computer as “dictionary”. The pronunciation is given in a string of input symbols (e.g., katakana in Japanese, Roman character representation of katakana) in the computer.
A word is registered in the computer by a method of registering all the possible spellings (characters), or collecting and registering the spellings having high use frequency, a method of registering only typical spellings, and searching for a word in combination with its pronunciation, or a method of providing various sorts of character conversion table apart from the dictionary and investigating the correspondence with headwords, or a combination of these methods.
A plain example for correcting the result of automatic word segmentation is given below. For example, for the pronunciation of  (ha-ki-mo-no), there are two corresponding spellings. One is the word  (footwear) and the other is a sequence of two words  (postpositional particle) and  (kimono). These two spellings are associated with the pronunciation “ha-ki-mo-no”. If there is an occurrence of this pronunciation and the spelling resulting from word segmentation is considered to be improper, the spelling is corrected by re-segmenting. Unlike English, Japanese language does not have a space between words (write with a space between words), therefore an expert must determine word boundaries from the context around an sample sentence, based on the knowledge of technical terms.
As an example indicates that the word  (footwear) is assigned to the pronunciation  (ha-ki-mo-no), it will be found that the word needs to be correctly recognized using the knowledge of vocabulary. Therefore, there is a demand for a method for increasing the accuracy, making effective use of the corpus without segmentation.
For all processes in natural language processing, conversion of character strings or speech data into a string of morphemes is a prerequisite. However, in Asian languages such as Japanese, it is difficult to morphologically analyze even written text because, unlike English text, text in such languages is written without a space between words. Therefore, as part of the accuracy problem described above, there is the need for accurately obtaining candidate morpheme strings (x) when input data (y) such as a hiragana character string, a katakana character string, or speech data is given.
In a statistical approach, this can be formulated as the maximization problem of P(x|y) and Bayes' theorem can be used to decompose it into two models of maximizing, P(y|x) and P(x), as shown in the right-hand side of the equationP(x|Y)=P(y|x)/P(x)/P(y)
where P(y) is a constant as y is given. The model of P(x) is independent of the type of input data (whether it is a symbol string, character string, or speech data), hence called a “language model”. One of the most commonly used probabilistic language models is a word n-gram model.
<Conventional Art Relating to the Use of Unsegmented Corpus>
As conventional art there are methods in which the result of segmentation of an unsegmented corpus based on training with a segmented corpus is used:                (a) Counting n-grams with weight by using candidate segmentations,        (b) Using only 1-best of the candidates resulting from automatic segmentation, and        (c) Using n-best of the candidates resulting from automatic segmentation.        
However, methods (a) and (c) require high computational costs for bi-gram and higher and are unrealistic. Advantages of the present invention over method (b) will be described later with respect to experiments.