The present invention relates to a speech recognition apparatus and a speech recognition method for an adaptation to both noise and speaker.
The main problems in automatic speech recognition exits in a background noise added to the speech to be recognized, and individual variation caused by phonetic organs or utterance habits of an individual speaker.
In order to achieve a robust speech recognition capable of coping with these problems, the speech recognition methods called an HMM (Hidden Markov Model) composition or also called a PMC (Parallel Model Combination) method have been studied (for example, see pages 553-556 of IEEE ICASSP 1998 “Improved Robustness for Speech Recognition Under Noisy Conditions Using Correlated Parallel Model Combination”).
At the pre-processing stage before performing a real speech recognition, the HMM composition method or the PMC method generates noise adaptive acoustic models (noise adaptive acoustic HMMs) as noise adaptive composite acoustic models by the composition of standard initial acoustic models (initial acoustic HMMs) and noise models (speaker's environmental noise HMM) generated from the background noise.
In real speech recognition stages, each likelihood of noise adaptive acoustic models having been generated in a pre-processing is compared with feature vector series, which are obtained from a cepstrum transformation of the uttered speech including the additive background noise, to output the noise adaptive acoustic model with the maximum likelihood as a result of speech recognition.
Technologies for speaker adaptation have been also studied diversely, and for example, a MAP estimation method or a MLLR method for renewing the mean vector and the covariance matrix of a model are known.
A conventional speech recognition, however, has a problem of requiring a large amount of processing for performing noise-adaptation of all initial acoustic models in order to obtain noise adaptive acoustic models (noise adaptive acoustic HMMs) to be compared with the feature vector series.
The required large amount of processing, which can not be accepted to keep high processing speed, hinders increasing the number of initial acoustic models. Thus, the lack of initial acoustic models obstructs the improvement of a recognition performance. It should be noted that it is possible to improve the efficiency of an environmental noise adaptation technology by using a clustering technique. However, it is hard to directly adapt well-known speaker adaptation technologies such as the MLLR method or the MAP estimation method to this environmental noise adaptation technology, that is, the coexistence of both noise and speaker adaptation technologies have been a subject to be solved.