1. Field of the Invention
The present invention relates to a speech recognition system, and specially relates to the speaker adaptive type speech recognition system which is robust to the noise. As used through-out the specification, voice recognition means speech recognition.
2. Description of the Related Art
In the related art, a system shown in FIG. 9 is well known as a speaker adaptive voice recognition system, for example.
This voice recognition system is provided with a previously prepared standard acoustic model 100 of an unspecified speaker, and a speaker adaptive acoustic model 200 is prepared by using a feature vector of an input signal Sc generated from an input voice uttered by a specified speaker, and the standard acoustic model 100, and the voice recognition is conducted by adapting the system to the voice of the specified speaker.
When the adaptive acoustic model 200 is prepared, the standard vector Va corresponding to a designated text (sentence or syllable) Tx is supplied from the standard acoustic model 100 to a path search section 4 and a speaker adaptation section 5, and further, actually, by uttering the designation text Tx by the specified speaker, the input signal Sc is inputted.
Then, after an additive noise reduction section 1 removes an additive noise included in the input signal Sc, a feature vector generation section 2 generates a feature vector series Vcf which represents the feature quantity of the input signal Sc. Further, a multiplicative noise reduction section 3 removes a multiplicative noise from the feature vector series Vcf, and generates the feature vector series Vc from which the additive noise and the multiplicative noise are removed. The feature vector series Vc is supplied to a path search section 4 and a speaker adaptation section 5.
In this manner, when the standard vector Va and the feature vector series Vc of the input signal Sc actually uttered are supplied to the path search section 4 and the speaker adaptation section 5, the path search section 4 compares the feature vector series Vc to the standard vector Va. Then, the appearance probability of the feature vector series Vc for each syllable, and the state transition probability from an syllable to another syllable are found. Thereafter, when the speaker adaptation section 5 compensates for the standard vector Va according to the appearance probability and the state transition probability, the speaker adaptive acoustic model 200 adaptive to the feature of the voice (input signal) proper to the specified speaker is prepared.
Then, the speaker adaptive acoustic model 200 is adapted to the input signal generated from the uttered voice by the specified speaker. Thereafter, when the specified speaker utters arbitrarily, the feature vector of the input singal generated from the uttered voice is collated with the adaptive vector of the speaker adaptive acoustic model 200, and the voice recognition is conducted in such a manner that the speaker adaptive acoustic model 200 which gives the highest likelihood is made a recognition result.
In this connection, in the above conventional adaptation type voice recognition system, when the adaptive acoustic model 200 is prepared, the additive noise reduction section 1 removes the additive noise by the spectrum subtraction method, and the multiplicative noise reduction section 3 removes the mulatiplicative noise by the CMN method (cepstrum means normalization), and thereby, the speaker adaptive acoustic model 200 not influenced by the noise is prepared.
That is, the additive noise reduction section 1 removes the spectrum of the additive noise from the spectrum of the input signal Sc after the spectrum of the input signal Sc is found. The multiplicative noise reduction section 3 subtracts the time average value from the cepstrum of the input signal Sc after the time average value of the cepstrum of the input signal Sc is found.
However, also in any of the spectrum subtraction method and the CMN method, it is very difficult to remove only noise. Because there is a case where the feature information of the utterance of the speaker proper to be compensated for by the speaker adaptation is also missed, the adequate speaker adaptive acoustic model 200 cannot be prepared. Therefore, there is a problem that the voice recognition rate is degraded.