The present invention relates to speech recognition apparatus of the speaker adaptation type in which the speech recognition apparatus can be adaptable to speeches of different speakers and in different circumstances.
There have been conventional methods of recognizing input speech using a prestored reference pattern which is a speech pattern of the standard voice. Such methods include DP (dynamic programming) matching method, and Hidden Markov Model (HMM) method described in the publicated paper, Proceeding of The IEEE. Vol. 73, No. 11, page 1625 "Structural Methods in Automatic Speech Recognition" November, 1985 (hereinafter, referred to as "reference 1").
In the speech recognition apparatus utilizing these methods, in the case of recognizing speech of a particular person different from the speaker of the reference pattern (hereinafter, referred to as "particular speaker"), the reference pattern has to be adapted to voice of the particular speaker so that the high recognition performance is obtained, because the speech pattern varies with the particular speaker. Further, in the case of recognizing speech uttered in circumferences where magnitudes of background noise differ or circumstances where a voice is transmitted through telephone line or not, the speech pattern is considerably deformed in such extent that modification of the reference pattern is necessitated. Hereinafter, the invention will be described mainly in connection with the case of recognizing the speech of particular speaker, but the invention can be applied to the recognition of speech uttered in different circumstances.
Conventionally, in the case of adapting the speech recognition apparatus to the particular speaker, the particular speaker is required to previously utter all of the words to be recognized, and replace the reference pattern with the speech pattern of the particular speaker. However, when the vocabulary of words to be recognized is rather large, laborious work is required for the particular speaker to utter all of the words to be recognized.
In order to eliminate such laborious work, there has been already proposed a method of adapting the reference pattern to the particular speaker according to a small amount of training speech pattern previously uttered by the particular speaker. For example, the speaker adaptation based on vector quantization is described in the publicated paper IEEE, ICASSP 86 49.5, page 2643, "Speaker Adaptation Through Vector Quantization" 1986 (hereinafter, referred to as "reference 2").
According to this method, firstly, a codebook for the vector quantization is generated from speech pattern of a reference speaker having standard voice (hereinafter, this codebook is called "the reference codebook"), and then the speech pattern is vector-quantized by using the reference codebook to generate the reference pattern. When speech of a particular speaker is recognized, the reference codebook is normalized to generate a normalized codebook by using a speech pattern of sample words selected from the vocabulary to be recognized, which has been previously uttered by the particular speaker (hereinafter, this speech pattern is called "training pattern"). The speech of a particular speaker can be recognized by using this normalized codebook in place of the reference codebook without modifying the vector-quantized reference pattern. Namely, even if the vocabulary to be recognized is large, the speech recognition apparatus can be adapted to utterance of the particular speaker by having the particular speaker previously utter part of the vocabulary without uttering all of the words belonging to the vocabulary.
Next, the method of making the normalized codebook will be described hereinbelow. Firstly, a training codebook is generated according to the training pattern, and then the training pattern is vector-quantized by using the training codebook. Subsequently, the time-coordinate or the time-axis of the reference pattern and the training pattern of the same word are matched by using the DP matching. The frequency of code vectors in the training codebook (hereinafter, these code vectors are called "training code vectors") is made to correspond to a given code vector in the reference codebook (hereinafter, this code vector is called "reference code vector") is stored in the form of a histogram. The normalized codebook is obtained by weighting the training code vectors made to correspond to each reference code vector according to their frequencies in the histogram and by averaging the weighted training code vectors.
However, in the conventional method as described in the reference 2, since the vector quantization is utilized, the reduction of recognition rate is not avoided due to quantization error.
In addition, the variation of speech pattern according to difference speakers and circumstances depends on influences due to preceding and succeeding phoneme, that is phonetic environment. Therefore, in the conventional conversion method based on one-by-one processing of an individual vector at a given point of time since the influences from the phonetic environment cannot be taken into account, the proper conversion cannot be carried out.