A. Field of the Invention
The present invention relates generally to speech recognition and, more particularly, to systems and methods that model speech using a small number of one-dimensional Gaussian distributions.
B. Description of Related Art
Conventional speech recognizers identify unknown spoken utterances. Through a process known as training, the recognizer examines known words and samples and records features of the words as recognition models. The recognition models represent typical acoustic renditions of known words. In the training process, the recognizer applies a training algorithm to the recognition models to form the stored representations that it uses to identify future unknown words.
Conventional speech recognizers typically perform speech recognition in four stages. In the first stage, the recognizer receives unknown speech signals from a source, such as a microphone or network. In the second stage, the recognizer determines features that are based on a short-term spectral analysis of the unknown speech signal at predetermined intervals, such as 10 ms. These features, commonly referred to as “feature vectors,” are usually the output of some type of spectral analysis technique, such as a filter bank analysis, a linear predictive coding analysis, or a Fourier transform analysis.
In the third stage, the recognizer compares the feature vectors with one or more of the recognition models that have been stored during the training process. During this comparison, the recognizer determines the degree of similarity between the feature vectors and the recognition models. In the final stage, the recognizer determines, based on the recognition model similarity scores, which recognition model best matches the unknown speech signal. The recognizer may then output the word(s) corresponding to the recognition model with the highest similarity score.
Many of today's speech recognizers are based on the hidden Markov model (HMM). The HMM provides a pattern matching approach to speech recognition. Conventional recognition systems commonly use two types of HMMs: discrete density HMMs and continuous density HMMs.
For discrete density HMMs, a conventional speech recognition system divides the feature space into a predetermined number of disjoint regions. Typically, the system computes one feature vector for every 10 ms of speech. The system determines, for each feature vector in the input speech, in which regions the feature vector lies. This usually does not require very much computation because the system performs this operation only once in each frame. Each probability distribution in the HMM then models the probability mass within each region. Thus, to obtain the probability of the input feature vector for a particular distribution, the speech recognition system need only look up the probability for the index of the region for the feature vector.
Continuous density HMMs model each distribution using a parametric function, such as a mixture of Gaussian distributions. That is, each distribution has its own set of multinomial Gaussian distributions that together form a probability density function. In this case, when a conventional speech recognition system compares an input feature vector with a probability distribution for a state, the system computes the weighted Euclidean distance from the input feature vector to each Gaussian distribution in the mixture distribution to determine the probability of the Gaussian distribution. This calculation may be represented by the following equation:
                    Dist        =                              ∑            D                          i              =              1                                ⁢                                          ⁢                                                    x                ⁡                                  (                  i                  )                                            -                                                u                  ⁡                                      (                    i                    )                                                  2                                                    σ              2                                                          Eq        .                                  ⁢        1            where x represents the input vector, u represents the mean of the Gaussian distribution, and σ represents the standard deviation of the Gaussian distribution (i.e., σ2 represents variance). The system computes the distance for each dimension of the input vector. A typical input vector may have 45 dimensions. As a result, the distance computation often dominates the computation needed for speech recognition.
Continuous density HMMs generally provide more accurate recognition than discrete density HMMs, making them more desirable. Many conventional speech recognition systems share distributions among multiple states to decrease the amount of training data needed and to decrease the amount of computation needed during recognition. Many other conventional systems share sets of Gaussian distributions among several distributions, but permit the distributions to have different mixture weights. The distance computation, however, still dominates the computation time in both of these systems. Generally, the more Gaussian distributions an HMM has, the more accurate the speech recognition is, as long as there is enough training data available. In practice, training data is always limited.
As a result, a need exists for a system and method that reduces the amount of computation needed for speech recognition and reduces the amount of training data needed to model the Gaussian distributions.