1. Field of Invention
The present invention relates to a noise adaptive speech recognition method and apparatus.
2. Description of the Related Art
In order to recognize speech in a noisy environment, various speech recognition methods have been suggested.
A first prior art speech recognition method is based on a noise superposition learning technology which superposes noise data onto a standard speech signal to prepare noise adaptive standard speech data, since the noise data is known to some extent in a noisy environment during speech recognition. Thus, the noise condition of a learning environment is brought close to that of a recognized environment, to improve the performance of speech recognition in a noisy environment (see JP-A-9-198079).
In the above-described first prior art speech recognition method, however, even when a noisy environment is recognized in advance, the voice level of a speaker, the distance between the speaker and a microphone, the volume gain of an apparatus, the noise level and the like fluctuate on a time basis, so that a signal-to-noise ratio (SNR) also fluctuates on a time basis independent of the recognized noisy environment. Since a correlation between the SNR and the performance of speech recognition is very high, the fluctuation of the SNR would invalidate the noise learning effect of the above-described first prior art speech recognition method.
In a second prior art speech recognition method, two standard feature vector series having different SNRs such as 0 dB and 40 dB are prepared in advance for each category such as “hakata”. Then, a distance between a segment linking two corresponding standard feature vectors of the vector series and a feature vector of an input speech signal is calculated at each point of a two-dimensional grid formed by the input speech signal. Finally, a minimum value of the accumulated distances for categories are found and a final category having such a minimum accumulated distance is determined as a recognition result. Therefore, standard feature vectors can be adapted to any value of SNR between 0 dB and 40 dB, so as to obtain high performance speech recognition (see JP-A-10-133688) and U.S. Pat. No. 5,953,699.
In the above-described second prior art speech recognition method, however, the amount of calculation of the above-mentioned distances based upon three points, i.e., two standard feature vectors and one feature vector of an input speech signal at each grid point is enormous, which would increase the manufacturing cost of a speech recognition apparatus.
Also, in the above-described second prior art speech recognition method, since individual optimization is carried out at each grid point, when a range of corresponding standard feature vectors of the two feature vector series for one grid point interferes with a range of standard feature vectors for an adjacent grid point, an overmatching or mismatching operation would be carried out. That is, the power of a consonant is relatively small compared with a vowel. Therefore, when the SNR is too small, the power of noise would become equivalent to the power of consonants. For example, the first consonant such as “h” of the category “hakata” is buried in the noise. Thus, the category “hakata” is deformed to a category “akata”. At worst, all the consonants such as “h”, “k” and “t” of the category “hakata” are buried in the noise. Thus, the category “hakata” is deformed to a category “aaa”.