The present invention relates to pattern recognition. In particular, the present invention relates to speech recognition.
A pattern recognition system, such as a speech recognition system, takes an input signal and attempts to decode the signal to find a pattern represented by the signal. For example, in a speech recognition system, a speech signal (often referred to as a test signal) is received by the recognition system and is decoded to identify a string of words represented by the speech signal.
Many speech recognition systems utilize Hidden Markov Models in which phonetic units are represented by a single tier of connected states. Using a training signal, probability distributions for occupying the states and for transitioning between states are determined for each of the phonetic units. To decode a speech signal, the signal is divided into frames and each frame is transformed into a feature vector. The feature vectors are then compared to the distributions for the states to identify a most likely sequence of HMM states that can be represented by the frames. The phonetic unit that corresponds to that sequence is then selected.
Although HMM-based recognition systems perform well in many relatively simple speech recognition tasks, they do not model some important dynamic aspects of speech directly (and are known to perform poorly for difficult tasks such as conversational speech). As a result, they are not able to accommodate dynamic articulation differences between the speech signals used for training and the speech signal being decoded. For example, in casual speaking settings, speakers tend to hypo-articulate, or under articulate their speech. This means that the trajectory of the user's speech articulation may not reach its intended target before it is redirected to a next target. Because the training signals are typically formed using a “reading” style of speech in which the speaker provides more fully articulated speech material than in hypo-articulated speech, the hypo-articulated speech does not match the trained HMM states. As a result, the recognizer provides less than ideal recognition results for casual speech.
A similar problem occurs with hyper-articulated speech. In hyper-articulated speech, the speaker exerts an extra effort to make the different sounds of their speech distinguishable. This extra effort can include changing the sounds of certain phonetic units so that they are more distinguishable from similar sounding phonetic units, holding the sounds of certain phonetic units longer, or transitioning between sounds more abruptly so that each sound is perceived as being distinct from its neighbors. Each of these mechanisms makes it more difficult to recognize the speech using an HMM system because each technique results in a set of feature vectors for the speech signal that do not match well to the feature vectors present in the training data.
HMM systems also have trouble dealing with changes in the rate at which people speak. Thus, if someone speaks slower or faster than the training signal, the HMM system will tend to make more errors decoding the speech signal.
Alternatives to HMM systems have been proposed. In particular, it has been proposed that the trajectory or behavior of a production-related parameter of the speech signal should be modeled directly. However, none of the proposals have completely modeled the dynamic aspects of speech. In particular, the models have not addressed the time-dependent change in the trajectory that occurs as the speaker approaches a desired target for a phonetic unit. In addition, the models have not provided a decoding means that allows for a probability determination based on continuous values for the trajectory while limiting the search space to a manageable number of trajectory states.
In light of this, a speech recognition framework is needed that explicitly models the production-related behavior of speech in terms of other model variables such that the dynamic aspects of speech trajectory are better modeled and decoding is manageable.