The present invention relates to pattern recognition. In particular, the present invention relates to speech recognition.
A pattern recognition system, such as a speech recognition system, takes an input signal and attempts to decode the signal to find a pattern represented by the signal. For example, in a speech recognition system, a speech signal (often referred to as a test signal) is received by the recognition system and is decoded to identify a string of words represented by the speech signal.
Many speech recognition systems utilize Hidden Markov Models in which phonetic units, which are also referred to as acoustic units or speech units, are represented by a single tier of connected states. Using a training signal, probability distributions for occupying the states and for transitioning between states are determined for each of the phonetic units. To decode a speech signal, the signal is divided into frames and each frame is transformed into a feature vector. The feature vectors are then compared to the distributions for the states to identify a most likely sequence of HMM states that can be represented by the frames. The phonetic unit that corresponds to that sequence is then selected.
Although HMM-based recognition systems perform well in many relatively simple speech recognition tasks, they do not model some important dynamic aspects of speech directly (and are known to perform poorly for difficult tasks such as conversational speech). As a result, they are not able to accommodate dynamic articulation differences between the speech signals used for training and the speech signal being decoded.
For example, in casual speaking settings, speakers tend to hypo-articulate, or under articulate their speech. This means that the recursively defined trajectory of the user's speech articulation may not reach its intended target before it is redirected to a next target. Because the training signals are typically formed using a “reading” style of speech in which the speaker provides more fully articulated speech material than in hypo-articulated speech, the hypo-articulated speech does not match the trained HMM states. As a result, the recognizer provides less than ideal recognition results for casual speech.
A similar problem occurs with hyper-articulated speech. In hyper-articulated speech, the speaker exerts an extra effort to make the different sounds of their speech distinguishable. This extra effort can include changing the sounds of certain phonetic units so that they are more distinguishable from similar sounding phonetic units, holding the sounds of certain phonetic units longer, or transitioning between sounds more abruptly so that each sound is perceived as being distinct from its neighbors. Each of these mechanisms makes it more difficult to recognize the speech using an HMM system because each technique results in a set of feature vectors for the speech signal that often do not match well to the feature vectors present in the training data.
Even if the feature vectors corresponding to the hyper- or hypo-articulated speech match those in the training data (which may be very expensive to obtain), the conventional HMM technique will still perform poorly because of the increased phonetic confusability for the HMM system that does not take into account the underlying causes of the changes in the feature vector trajectories induced by hyper- or hypo-articulation. This problem is addressed specifically by the current invention.
HMM systems also have trouble dealing with changes in the rate at which people speak. Thus, if someone speaks slower or faster than the training signal, the HMM system will tend to make more errors decoding the speech signal.
Alternatives to HMM systems have been proposed. In particular, it has been proposed that the statistically defined trajectory or behavior of a production-related parameter of the speech signal should be modeled directly. Since the production-related values cannot be measured directly, these models are known as Hidden Dynamic Models (HDM). Hidden Dynamic Models are one example of a class of models known as switching state space models, which model the value of a parameter for a current frame based on the value of the parameter in one or more previous frames and one or more constants selected for the frame.
One problem with HDMs is that it is difficult to train them because common training algorithms, such as the Expectation-Maximization algorithm, become intractable for HDMs. This is due largely to the fact that in order to obtain the posterior probability for a sequence of hidden parameters given a sequence of input values, the probability for the combination of a hidden parameter and a possible speech unit must be summed over all possible sequences of speech units. This leads to a computation that increases exponentially with each additional frame of input values.
To overcome this problem, some systems of the prior art have assumed a fixed sequence of speech units during training. The boundaries between the speech units that define this sequence are set using HMM training before training the HDM. This is not theoretically optimal because the boundary parameters of the speech units are being fixed based on a different criteria than the other parameters in the hidden dynamic model.
Thus, a training system is needed that allows the boundaries to be trained with the other parameters of a hidden dynamic model while overcoming the intractability associated with such training.