The invention relates to speech coding, such as for a speech recognition system.
The first step of speech recognition involves measuring the utterance to be recognized. A speech coding apparatus may measure, for example, the amplitude of the utterance to be recognized in one or more frequency bands during each of a series of time intervals (for example, ten-millisecond time intervals). Each measurement by the speech coding apparatus may be filtered, normalized, or otherwise manipulated to obtain desired speech information, with the result being stored as an acoustic feature vector.
In a speech recognition apparatus, the acoustic feature vectors produced by the speech coder from an utterance to be recognized are compared to acoustic models of words to find the best matched models. In order to simplify the comparison, the acoustic feature vectors may be converted from continuous variables to discrete variables by vector quantization. The discrete variables may then be compared to the acoustic models.
The acoustic feature vectors may be quantized by providing a finite set of prototype vectors. Each prototype vector has an identification (a label), and has one or more sets of parameter values. The value of an acoustic feature vector is compared to the parameter values of the prototype vectors to find the closest prototype vector. The identification (label) of the closest prototype vector is output as a coded representation of the acoustic feature vector.
Each prototype value may be obtained, for example, by averaging the values of a set of acoustic feature vectors corresponding to the prototype vector. Acoustic feature vectors may be correlated with prototype vectors, for example, by coding an utterance of a known training script by using an initial set of prototype vectors, and then finding the most probable alignment between the acoustic feature vectors and an acoustic model of the training script.
It has been found, however, that a single average for each prototype vector does not accurately model the prototype vector. A better model is obtained if each prototype vector consists of a mixture of partitions obtained by dividing the set of acoustic feature vectors corresponding to the prototype vector into a number of clusters.
The set of acoustic feature vectors corresponding to a prototype vector, may, for example, be grouped according to the context (for example, the preceding or following sounds) of each acoustic feature vector in the training script. Each context group may be divided into clusters of acoustic feature vectors arranged close to each other (for example, by K-means clustering), in order to adequately model each prototype vector. (See, Clustering Algorithms, John A. Hartigan, John Wiley & Sons, Inc., 1975.) Each cluster of acoustic feature vectors forms a partition. Each partition may be represented by values such as the average of the acoustic feature vectors forming the partition, and the covariance matrix of the acoustic feature vectors forming the partition (for simplicity, all off-diagonal terms of the covariance matrix may be approximated by zero.)
In order to adequately model each prototype vector in the manner described above, substantial amounts of training data from utterances of training scripts are needed, and substantial computing resources are needed to analyze the training data. Moreover, there is no correlation between clusters of acoustic feature vectors from one speaker to another, so prototype vector data from one speaker cannot be used to assist in generating prototype vectors for another speaker.
Further, in order to compare tile value of an acoustic feature vector to the parameter values of a prototype vector, the value of the acoustic feature vector must be matched to the parameter values of all partitions making up the prototype vector to produce a combined match score. It has been found, however, that typically the match score for the partition closest to the acoustic feature vector dominates combined match scores for all partitions. Therefore, the prototype match score can be approximated by the match score for the one partition of the prototype which is closest to the acoustic feature vector.