1. Field of the Invention
The present invention relates to the pattern recognition that uses the Hidden Markov Model for expressing recognition targets such as speech, character, figure, etc., and more particularly, to a scheme for model adaptation which is aimed at correcting a mismatch of a model due to a difference between a condition at a time of model creation and a condition at a time of model use in form of recognition execution, and thereby improving the recognition performance.
Note that the present invention is generally applicable to various pattern recognition using the Hidden Markov Model (HMM), but the following description will be given for an exemplary case of speech recognition for the sake of clarity.
2. Description of the Background Art
In the speech recognition, the input speech data is matched with the acoustic model (phoneme model, syllable model, word model, etc.) obtained from training speech data and the likelihood is determined so as to obtain the recognition result. Here, a parameter of the model largely depend on conditions (background noise, channel distortion, speaker, vocal tract length, etc.) under which the training speech data are recorded. Consequently, when the speech recording condition is different from the condition at a time of actual recognition, there arises a mismatch between the input speech pattern and the model which in turn causes a lowering of the recognition rate.
Such a lowering of the recognition rate due to a mismatch between the input speech data and the acoustic model can be prevented by re-creating the model by using the speech data recorded under the same condition as that at a time of actual recognition. However, the model based on the statistical method such as HMM requires an enormous amount of training speech data so that the processing requires a considerable time (such as 100 hours, for example). For this reason, there is a need for the adaptation technique that can adapt a mismatching model to a model that completely matches with the condition at a time of actual recognition, by using less amount of training data and less processing time.
As an example of condition change, there is a change of the background noise at a time of utterance. The recognition rate is lowered when the background noise at a time of model training speech data recording is different from the background noise at a time of actual recognition.
The conventionally known techniques for adaptation of the model with respect to the background noise include the HMM composition schemes such as PMC (see M. J. F. Gales et al.: "An Improved Approach to the Hidden Markov Model Decomposition of Speech and Noise", Proc. of ICASSP92, pp. 233-236, 1992, for example) and NOVO (see F. Martin et al.: "Recognition of Noisy Speech by using the Composition of Hidden Markov Models", Proc. of Acoustic Society of Japan Autumn 1992, pp. 65-66, for example). The HMM composition scheme is an adaptation technique in which the HMM trained by using clean speeches without noise that were recorded in a soundproof room (which will be referred to as a clean speech HMM hereafter) is combined with the HMM trained by using only background noises at a time of recognition (which will be referred to as a noise HMM hereafter), so as to obtain the HMM that can match with the input speech by having the background noises at a time of recognition superposed therein. The use of the HMM composition scheme only requires the training of the noise HMM and the processing time for the model composition, so that it is possible to adapt the model by a relatively less time compared with a case of re-creating the model by using an enormous amount of speech data.
However, the conventional speech recognition has been associated with the problem that it is difficult to adapt the model in real time according to continuously changing condition, because a rather long noise recording time (15 seconds, for example) is required for the purpose of obtaining the training data for the noise HMM and a rather long processing time (about 10 seconds) is required as the processing time for the model composition.