The present invention generally relates to an improvement in a speech recognizing device and, more particularly, to a speech recognizing device capable of accurately recognizing speech which is uttered in a noisy environment.
In a prior art speech recognizing device, word recognition is usually implemented by a DP matching principle as taught by Sakoe and Chiba in a paper entitled "Continuous Word Recognition Based on Time Normalization of Speech Using Dynamic Programming", the Institute of Acoustic Engineers of Japan, Transactions, 27, 9, pp. 483-500 (1971) (hereinafter referred to as reference 1). A problem heretofore pointed out with this kind of scheme is that the recognition accuracy of a noisy speech is lower than that of speech spoken in a quiet background. This ascribable to the fact that not only the speech in noisy environment is masked by additive noise but also the spectrum of the utterance itself is deformed. The deformation is in turn ascribable to the general tendency that one speaks louder and clearer in noisy environments because the speaker cannot hear his own utterance clearly. For example, the spectra of a certain vowel spoken in quiet and noisy environments by the same male speaker show that the utterance in noisy environment has not only greater overall energy but also has the contour of the spectrum, formant positions and bandwidth changed. In general, such a change is observed with all the vowels. In this manner, the spectrum noticeably differs from quiet background to noisy background even for the same vowel, resulting in a substantial distance between vowel patterns and therefore in recognition errors.
Some different approaches are available for recognizing speeches in noisy environments. For example, it is known that the recognition accuracy of a noisy speech increases as the environments at the time of recognition and that of registration become close to each other, as C. H. Lee and K. Ganesan teach in "Speech Recognition Under Additive Noise", ICASSP 1984, 35.7 (1987.3) (hereinafter referred to as reference 2). A method which may be derived from this finding is to register standard patterns uttered in a number of different environments beforehand (hereinafter referred to as method 1). It was reported that a method using a weighted Cepstrum distance as a distance measure (hereinafter referred to as method 2) is advantageous for the recognition of noisy speeches, by Umezaki and Itakura in "Comparison and Evaluation of the Distance Measures by Weighted FFT Cepstrum and Smoothed Group Delay Spectrum Coefficients", the Institute of Acoustic Engineers of Japan, Manuscript Collection, 1-5-11, Aug. 1987 (hereinafter referred to as reference 3). Further, the above-mentioned spectra suggest that the spectrum deformation is significant in the frequency range above 2.5 kilohertz but insignificant in the frequency range lower than the same. This tendency holds true with other vowels also. In the light of this, a speech may be recognized by using the characteristics of a spectrum of the frequency range lower than 2.5 kilohertz (hereinafter referred to as method 3).
However, method 1 cannot cope with the spectrum fluctuation of a speech in noisy environment without increasing the time and labor necessary for registration, the amount of storage, and the amount of processing to prohibitive degrees. Method 2 is advantageously applicable to additive white noise and the like because the weighted Cepstrum distance is given much weight in the formant peaks. However, method 2 is susceptible to the changes in the format positions and bandwidth and therefore cannot deal with the above-discussed spectrum fluctuation. Further, method 3 is apt to rather aggravate the recognition accuracy because it cannot readily identify fricatives, plosives and other consonants in distinction from each other.