A pattern recognition system, such as a speech recognition system, takes an input signal and attempts to decode the signal to find a pattern represented by the signal. For example, in a speech recognition system, a speech signal is received by the recognition system and is decoded to identify a string of words represented by the speech signal.
Many speech recognition systems utilize Hidden Markov Models in which phonetic units are represented by a single tier of connected states. Using a training signal, probability distributions for occupying the states and for transitioning between states are determined for each of the phonetic units. To decode a speech signal, the signal is divided into frames and each frame is transformed into a feature vector. The feature vectors are then compared to the distributions for the states to identify a most likely sequence of HMM states that can be represented by the frames. The phonetic unit that corresponds to that sequence is then selected.
Although HMM-based recognition systems perform well in many relatively simple speech recognition tasks, they do not model some important dynamic aspects of speech directly (and are known to perform poorly for difficult tasks such as conversational speech). As a result, they are not able to accommodate dynamic articulation differences between the speech signals used for training and the speech signal being decoded. For example, in casual speaking settings, speakers tend to hypo-articulate, or under articulate their speech. This means that the trajectory of the user's speech articulation may not reach its intended target before it is redirected to a next target. Because the training signals are typically formed using a “reading” style of speech in which the speaker provides more fully articulated speech material than in hypo-articulated speech, the hypo-articulated speech does not match the trained HMM states. As a result, the recognizer provides less than ideal recognition results for casual speech.
A similar problem occurs with hyper-articulated speech. In hyper-articulated speech, which often occurs in noisy environments, the speaker exerts an extra effort to make the different sounds of their speech distinguishable. This extra effort can include changing the sounds of certain phonetic units so that they are more distinguishable from similar sounding phonetic units, holding the sounds of certain phonetic units longer, or transitioning between sounds more abruptly so that each sound is perceived as being distinct from its neighbors. Each of these mechanisms makes it more difficult to recognize the speech using an HMM system because each technique results in a set of feature vectors for the speech signal that does not match well to the feature vectors present in the training data.
HMM systems also have trouble dealing with changes in the rate at which people speak. Thus, if someone speaks slower or faster than the training signal, the HMM system will tend to make more errors decoding the speech signal.
Alternatives to HMM systems have been proposed. In particular, it has been proposed that the trajectory or articulatory behavior of the speech signal should be modeled directly. Therefore, one prior system provides a framework for explicitly modeling articulatory behavior of speech. That system identifies an articulatory dynamics value by performing a linear interpolation between a value at a previous time and an articulatory target. The articulatory dynamics value is then used to form a predicted acoustic feature value that is compared with the observed one, and used to determine likelihood that the observed acoustic feature value was produced by a corresponding phonological unit. However, the hidden dynamic variable was represented by a continuously varying variable. This makes parameter training and decoding very difficult. Although another prior system used a discretely varying variable to represent the hidden dynamic variable to reduce such a difficulty, first-order dynamics were explored only.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.