One of the subjects in speech recognition processing for recognizing a speech by means of a computer has been to perform highly precise recognition even under an environment where a variety of noise sources exist. Heretofore, as methods for performing the speech recognition under such a noise environment, various methods have been proposed, which include the spectral subtraction method, the HMM (hidden Markov model) composition method, the ODCN (codeword-dependent cepstral normalization) method, and the like.
In view of the fact that these methods have an aspect to recognize a speech, basically, the methods specify a part corresponding to noise from a speech signal in concerned speech after completion (or generation) of one utterance, and perform the speech recognition, considering (or removing) the specified noise part.
For example, the HMM composition method synthesizes various HMMs of noises and speechs together to generate phoneme hidden Markov models (composite HMMs) into which noise elements are incorporated, and performs the speech recognition based on a composite HMM highest in likelihood with respect to the speech to be recognized, thus coping with the noise. Such a conventional HMM composition method selects a composite HMM highest in likelihood for each speech and adopts the composite HMM as a recognition result. Specifically, one noise HMM comes to be selected for each utterance.
Incidentally, the way of noise generation is diversified by including noise that continues to be generated regularly, noise that is generated suddenly and noise that is generated irregularly under the environment where various noise sources exist. The above-described technology of coping with noise in the conventional speech recognition processing recognizes a type of the noise for each speech. Therefore, the technology exerts a sufficient effect for the noise that continues to be generated regularly and the noise that is generated regularly, and can realize good speech recognition.
However, the noise generated suddenly or the noise generated irregularly may possibly be generated during speech, and the conventional technology of recognizing the type of noise for each speech cannot cope with such noise that changes rapidly. This has been causing the precision of the speech recognition to be lowered.