1. Field Of The Invention
The present invention relates to a speech recognition apparatus.
2. Description Of The Related Art
The recognition technology for speech recognition apparatus where the speaker is not specified has been advancing to the extent where speech recognition apparatus may have a large vocabulary of a hundred to a thousand words, rather than a small vocabulary of just ten or twenty words. With small vocabulary word recognition, a large number of speakers will pronounce all the words in the vocabulary during a reference speech training process. All of these words are then recognized as a single recognition unit. However, with large vocabulary word recognition or continuous speech recognition for recognizing a number of continuously generated words, if the kind of training described above is carried out, useful training for the vocabulary of words becomes difficult. Therefore, at the training process in large vocabulary recognition and continuous speech recognition, the vocabulary, i.e., word, connected words or sentence, are divided into smaller periods i.e. syllable or phoneme etc. and training is carried out on each period as the recognition unit.
At recognition, a method is adopted whereby recognition results for words are executed by successively connecting he recognition results for every recognition unit. In this way, it is not necessary to generate all the words at learning so that learning is possible by just generating a word set including at least one recognition unit. This is laid out in detail in the following documents.
"Recognition of Spoken Words Based on VCV Syllable Unit" Ryohei Nakatsu, Masaki Kohda, The Institute of Electronics, Information and Communication Engineers of Japan Journal Vol. J61-A No. 5 pp. 464-471 (1978.5); and "Syntax Oriented Spoken Japanese Understanding System" Yoshihisa Ohguro, Yasuhide Hashimoto, Seiichi Nakagawa, Electro-communications society technical journal, SP88-87, pp55-62 (1988).
However, in the examples in the aforementioned prior art technical papers, because the reference speech is corresponding to the syllable units rather than word units, information about the relationship between each of the syllables within a word is not reflected in the syllable reference speech, while the information is reflected in word reference speech. For example, the information about syllable duration does not reflect the relationship between each of the syllables. Therefore, even if the matching period for each of the syllables within the same word are uneven and unnatural, if the distance value is small, recognition including error is output.
As shown in FIG. 5 the (a) represents voice waveforms, the (b) represents an incorrect recognition result "kyo"ba"si"e"ki"wa"tu"ka"e"ma"su"ka" which is output by the conventional recognition method using the syllable recognition unit. As mentioned above the conventional method does not use syllable duration relationship and therefore though the second "e" has an unnaturally long duration, this recognition result is output as a correct recognition.
Meanwhile the (c) represents the correct recognition result "kyo"ba"si"e"ki"no"ti"ka"ku"ni"a"ri"ma"su"ka" which is output by the present invention recognition method using the syllable recognition unit and duration relationship as described below. There is no unnatural duration and no unnatural syllable matching period.