In recent voice recognition, a method of performing pattern matching based on statistical techniques with a frequency pattern of input voice as a feature vector is mainly used. In this method, voice recognition is provided by pattern matching between an acoustic model previously obtained by using feature vectors of frequency patterns of voice data uttered by a large number of speakers and modeling statistical features of the feature vectors, and feature vectors of input voice. Thus, by training an acoustic model using, as training data, voice obtained by recording utterances of many speakers by means of different microphones having different frequency characteristics, since feature vectors of the data are statistically reflected in the acoustic model, it is possible to construct an acoustic model that is robust to different microphones or speakers. However, the dispersion of feature vectors represented by the acoustic model may be increased, and the identification performance may be degraded.
To the contrary, there is a method called cepstral mean normalization (CMN) that decreases the dispersion of an acoustic model and improves recognition performance. In training of the acoustic model, this method uses, as training data, the resultant obtained by, based on feature vectors of voice data of each speaker included in training data, for each speaker, obtaining a mean vector of the feature vectors of the speaker and subtracting the mean vector from the feature vectors of the speaker. The mean vector represents an average characteristic of the frequency characteristic of the microphone used for recording the voice of the speaker and a frequency pattern of the voice of the speaker. Thus, by subtracting the mean vector from the feature vectors of the speaker, it is possible to absorb differences in microphones or speakers to some extent. When an acoustic model for, for example, the vowel “a” is trained using this training data, it is possible to more accurately model a feature vector of the sound of “a” itself while reducing effects due to differences in microphones or speakers, providing the advantage of improving recognition performance. However, to perform voice recognition using an acoustic model trained by performing CMN, it is necessary to obtain a mean vector of input voice in some way and subtract it from the input voice, also in voice recognition.
Patent Literature 1 discloses a method that, when a hidden Markov model (HMM) is used as an acoustic model, performs CMN by approximately obtaining a mean vector from HMM parameters obtained after the training, instead of performing CMN in the training. It discloses a technique that, by combining this method with noise adaptation of an acoustic model, quickly obtains an acoustic model that is robust to both multiplicative distortion due to difference in frequency characteristics of microphones or other factors and additive distortion due to ambient noise or other factors. Patent Literature 1 discloses, as a method of calculating a mean vector of input voice, a method of calculating, for each utterance of input voice, a mean vector from the entire utterance, or calculating a mean vector from feature vectors until the preceding utterance in voice recognition.